Pulp Engine Document Rendering
Get started
Release v0.65.0

Release v0.65.0

Date: 2026-04-11

Theme

Dev Team Blockers — Phase A (observability & reliability) + Phase B (template lifecycle & quality), end-to-end.

v0.65.0 bundles the two themed roadmap phases that took Pulp Engine from “single-tenant homelab” to “multi-env SaaS-ready” on the lifecycle axis. The 20-person dev team that was evaluating Pulp Engine and churning on ten specific gaps gets six of them resolved in this release (per-template metrics #3, webhook DLQ #6, request prioritization #8, staging/prod labels #1, A/B variants #2, template testing #10), plus the editor UI that lets operators drive label promotion from the tool they’re already in (B.1a).

Phase C (tenancy/billing/fairness) and Phase D (developer ergonomics) are not in this release — Phase C is gated on an architectural decision about the tenant primitive, Phase D is the lowest priority per the plan.

At a glance

AreaWhat shippedNew env / API surface
A.1 MetricsPer-template cardinality-bounded label on render histogramsMETRICS_TEMPLATE_LABEL_MODE, METRICS_TEMPLATE_LABEL_MAX, METRICS_TEMPLATE_LABEL_ALLOWLIST
A.2 PrioritizationReserved preview lane in browser poolRENDER_PREVIEW_RESERVED_SLOTS
A.3 DLQPostgres schedule-delivery DLQ + admin replay/abandonSCHEDULE_DLQ_ENABLED, SCHEDULE_DLQ_RETENTION_DAYS, SCHEDULE_DLQ_REPLAY_REQUIRES_ARTIFACT, 4 admin routes
B.0 HelperCentralized resolveTemplate() — pure refactorInternal only
B.1 LabelsTemplate label CRUD + If-Match concurrency4 routes under /templates/:key/labels
B.2 Testing@pulp-engine/template-testing + CLI test/promote commandspulp-engine test, pulp-engine promote, X-PulpEngine-Test-Run-Id header
B.3 VariantsTemplateLabel.trafficSplit + controlLabel, bucketing passVARIANTS_ENABLED, bucketKey body field, x-pulp-engine-bucket header, 3 response headers
B.1a EditorLabelsPanel, version chips, scope-aware UIscope field on EditorTokenResponse, getStoredScope() helper

Phase A — Observability & Reliability

A.1 — Per-template render metrics (#3)

The three render histograms (render_phase_duration_seconds, render_output_size_bytes, render_page_count) and render_requests_total gained a bounded templateKey label. Cardinality stays ≤ METRICS_TEMPLATE_LABEL_MAX + 1 regardless of how many templates a deployment has, because unknown / evicted keys resolve to a single __other__ bucket.

Three modes:

  • topn (default) — LFU top-N tracker with 50-key budget. Tracks {count, lastSeenMs} per key and evicts the lowest-score entry when the budget is exceeded. Automatic, but a burst of 50+ distinct templates in one window can churn the tracker.
  • allowlist — explicit comma-separated list in METRICS_TEMPLATE_LABEL_ALLOWLIST. Predictable for deployments with a stable template set; documented as the “production” option.
  • off — label is emitted but always __other__. Preserves query compatibility so sum without (templateKey) keeps working regardless of mode.

Implementation lives in metrics-template-label.ts behind resolveTemplateLabel(key), called from render-accounting.ts before every .observe(...). Unit coverage at metrics-template-label.test.ts asserts eviction order, allowlist passthrough, and the integration verification renders 60 distinct keys with MAX=50 and asserts the scraped /metrics output has ≤ 51 distinct values.

Rollback: flip METRICS_TEMPLATE_LABEL_MODE=off. Dashboards using sum without (templateKey) keep working either way.

A.2 — Request prioritization (#8)

The PDF renderer’s single pageSemaphore was replaced with a two-semaphore split in browser-pool.ts:

TOTAL = MAX_CONCURRENT_PAGES
PREVIEW_RESERVED = RENDER_PREVIEW_RESERVED_SLOTS  (default 1)
BATCH_MAX = TOTAL - PREVIEW_RESERVED

Preview acquisitions first try the preview semaphore; if exhausted they fall back to the batch semaphore (previews can burst when batch is idle). Batch acquisitions only ever see the batch semaphore. The reserved lane guarantees previews never wait behind an N-item production batch — editor save-to-preview stays snappy during a 500-document nightly run.

acquirePageSlot() now takes an optional priority: 'preview' | 'batch' parameter; default 'batch' keeps every existing caller safe. The preview render routes (/render/preview/*) pass priority: 'preview'; production render routes pass priority: 'batch'. A priority label (cardinality 2) was added to the renderQueueDepth gauge.

Starvation coverage lives in the package’s own unit tests, not /metrics scraping — see concurrency-cap.test.ts and concurrency-cap-disabled.test.ts. The unit tests fill the batch lane with N concurrent fake renders and assert a preview acquire during saturation completes in ≤ one batch-slot release, not after the full batch queue drains.

Known limitation (A.2b): the renderQueueDepth gauge is only updated in the in-process render path at routes/render/render.ts:146, which is not the default dispatcher mode. Wiring the gauge into the child-process, container, and socket dispatcher modes is tracked as a follow-up to this release. The starvation invariant the A.2 feature actually protects is covered by the in-package unit tests regardless of dispatcher mode.

Rollback: RENDER_PREVIEW_RESERVED_SLOTS=0 collapses the behavior back to today’s single-semaphore path exactly.

A.3 — Schedule-delivery DLQ (#6, Postgres-only)

A new ScheduleDeliveryDlqEntry Prisma model captures failed schedule-delivery targets so operators can inspect and replay them without digging through logs. Per-target granularity (not per-execution) so a partially-failed execution produces multiple rows; a status filter at the execution level would conflate targets and force JSON digging into delivery payloads.

Critical: references only, never secrets. A DLQ row stores scheduleId, executionId, targetIndex, denormalized targetType, a targetIdHash for dedupe, sanitized lastError, attempts, timestamps, and status. It does not store the target URL, headers, body, or webhook secret. Replay rehydrates the current target config from the live schedule via scheduleStore.getById(scheduleId). If the schedule was edited or deleted between failure and replay:

  • schedule_gone → 409, row transitions to orphaned
  • schedule_mutated → 409, row stays pending so the operator can investigate
  • render_artifact_expired → 409 when ScheduleExecution.renderOutputRef is gone

All three use 409 Conflict (not 404) because the row still exists at the requested URL — the conflict is that current state can’t satisfy the replay.

Why references instead of a denormalized copy: maskScheduleSecrets in schedules.routes.ts already exists because webhook secrets are sensitive and masked on every response. A denormalized DLQ row would create a second durable store of the same secret and a second admin response surface to redact. References avoid stale config too: if an operator fixes a bad webhook URL by editing the schedule, replay automatically uses the fix.

New admin routes (admin-scope gated, mounted under /admin/schedule-dlq):

  • GET /admin/schedule-dlq — list, filter by status / scheduleId, paginated
  • GET /admin/schedule-dlq/:id — single row
  • POST /admin/schedule-dlq/:id/replay — rehydrate + single-shot dispatch (no retry loop — one shot)
  • POST /admin/schedule-dlq/:id/abandon — terminal state + audit event with operator-supplied reason

Dispatcher wiring is minimal: dispatcher.ts calls an injected onExhausted({scheduleId, executionId, targetIndex, targetType, lastError}) callback on the exhausted-retries branch, keeping the dispatcher storage-agnostic. The schedule engine in schedule-engine.ts iterates schedule.deliveryTargets with a known index and supplies the callback, guarded by if (scheduleStore && dlqEnabled).

Storage backend reality: scheduling is Postgres-only today (file and SQL Server modes return 503 scheduling_not_available from /schedules), so the DLQ is Postgres-only by construction. In non-Postgres modes the DLQ store is null and the admin routes return 503 dlq_not_available, matching the existing scheduling pattern. The store is wired via storage-factory.ts behind the same gate as PostgresScheduleStore.

Retention is folded into the existing schedule-engine purge tick. The helper previously known as maybePurgeExecutions is renamed to maybePurgeRetentionStores and now runs both scheduleExecutionStore.purgeOlderThan(executionCutoff) and scheduleDeliveryDlqStore.purgeOlderThan(dlqCutoff) sequentially with independent cutoffs but a single lastPurgeAt gate — avoiding the naive “two helpers, two timestamps” trap where either purge could starve the other. No new scheduler, no new timer, no server.ts changes.

New env:

  • SCHEDULE_DLQ_ENABLED (default true when scheduleStore is Postgres; always no-op otherwise)
  • SCHEDULE_DLQ_RETENTION_DAYS (default 30)
  • SCHEDULE_DLQ_REPLAY_REQUIRES_ARTIFACT (default true)

Migration: additive — new table only. Rollback via feature flag leaves nothing to roll back in non-Postgres modes.

Known limitation (A.3b): async batch webhook deliveries are a separate delivery path from schedule deliveries, have their own accounting, and remain log-only in this release. Follow-up to pick them up once the schedule DLQ shape is validated in production.

Phase B — Template Lifecycle & Quality

B.0 — Centralized template resolution (prerequisite, pure refactor)

Before B.1 could land cleanly, the body.version ? getVersionDefinition(...) : getByKey(...) pattern was duplicated in 14 places — 11 render fan-out sites across routes/render/render.ts, 2 in routes/render/batch-async.ts, and 1 in schedule-engine.ts. Adding label resolution to every call site would require touching all fourteen and hoping no call site got missed.

The fix is a single resolveTemplate() helper that takes { key, version?, label? } and returns either { definition, resolvedVersion, resolvedLabel?, source } or a structured { notFound: true, reason: 'unknown_template' | 'unknown_version' | 'unknown_label' }. Every pre-existing call site was migrated. Error messages were preserved byte-for-byte (tests assert on them), so this shipped as a pure refactor with zero assertion changes — the entire existing render, schedules, and schedule-engine test suite passes unchanged.

A narrowly-scoped grep gate at scripts/check-template-resolution.mjs fails CI if any file in the render or schedule dispatch allowlist (routes/render/render.ts, routes/render/batch-async.ts, routes/render/batch-shared.ts, lib/schedule-engine.ts, schedule create/update in schedules.routes.ts) grows a new direct getVersionDefinition call. Version-management endpoints in routes/templates/index.ts, the storage implementations in apps/api/src/storage/, and the instrumentation layer are explicitly allowlisted as legitimate direct callers — the helper is for the render + schedule dispatch path, not a blanket ban.

Turned B.1’s label-resolution wiring from “fourteen-site fan-out and pray” into a single-file change.

B.1 — Template labels (#1, API only)

Templates gained named pointers via a new TemplateLabel Prisma model:

model TemplateLabel {
  id         String   @id @default(cuid())
  templateId String   @map("template_id")
  label      String                          // 'prod' | 'staging' | custom
  version    String                          // → TemplateVersion.version
  trafficSplit Int?   @map("traffic_split")    // B.3
  controlLabel String? @map("control_label")   // B.3
  updatedAt  DateTime @updatedAt @map("updated_at")
  updatedBy  String?  @map("updated_by")
  template   Template @relation(fields: [templateId], references: [id], onDelete: Cascade)

  @@unique([templateId, label])
  @@index([templateId])
  @@map("template_labels")
}

Decision: Template.currentVersion stays the HEAD-of-mutation pointer, NOT aliased to prod. Aliasing would mean every PUT silently promotes to prod — the opposite of what staging exists for. A one-shot data migration seeds prod label = currentVersion for every existing template so legacy render calls (no label, no version) keep working exactly as before.

Equivalent schemas landed in the file and SQL Server backends — migration 004_template_labels.sql for SQL Server, labels-sidecar files for the filesystem store.

Routes (dual-scoped read; admin-only mutation):

  • GET /templates/:key/labels — list
  • GET /templates/:key/labels/:label — resolve to full template definition
  • PUT /templates/:key/labels/:label — create or re-point; honors If-Match: "<current-version>" (412 on mismatch). Missing If-Match upserts unconditionally for automated promotion flows.
  • DELETE /templates/:key/labels/:label — idempotent

Render body schemas across all 11 single-render verbs, async batch (batch-async.ts), AND the shared sync-batch schema (batch-shared.ts) now accept label as a field mutually exclusive with version. All three schema files had to be updated — missing batch-shared.ts would have silently rejected label on sync batch endpoints even though single-render verbs accepted it.

Schedule model gained a nullable templateLabel column with the same mutual-exclusion guard enforced at the TypeBox schema boundary in schedules.routes.ts. The schedule engine already calls resolveTemplate after B.0, so it only needed to pass the new field through.

Audit model gained label_set and label_delete operations. Both are written to the structured audit log on every mutation.

B.2 — Template testing framework (#10)

New out-of-process package at packages/template-testing/, driven by the CLI. The server stays stateless about test execution — no pixelmatch / pdf-rasterization / puppeteer in the production image.

Fixture format (<template>.test.yaml):

template: invoice
version: "1.0.4"
label: staging
tests:
  - name: "renders with line items"
    input: { ... }
    expect:
      dryRun: ok
      htmlSnapshot: __snapshots__/basic.html
      pdfSnapshot: __snapshots__/basic.pdf.png
      tolerancePixels: 50

Critical: the harness targets PRODUCTION render routes, not preview. The runner calls POST /render, /render/html, /render/docx, /render/xlsx, /render/csv, /render/pptx — all mounted under the render plugin in server.ts. Preview routes (/render/preview/*) are intentionally avoided because:

  1. Preview route registration is gated by production mode + PREVIEW_ROUTES_ENABLED (see config.ts:225). A hardened production deployment with preview routes disabled returns 404 on every /render/preview/* path, so a customer CI pipeline targeting preview routes would break against any real production instance.
  2. Preview routes accept a full template definition in the body, bypassing the {key, version, label} resolution path — the exact path the test harness is supposed to validate.
  3. They are a different code path from what customers hit at runtime, so snapshot tests there give false confidence.

Each test runs in two passes: (1) dryRun:true to exercise Handlebars expressions cheaply (80× faster per the v0.61.0 Stage 3 numbers), then (2) a real render whose bytes are compared against the snapshot (HTML: normalize + diff; PDF: rasterize via pdf-to-png-converter + pixelmatch with tolerancePixels).

CLI registered in packages/cli/src/index.ts alongside the existing commands:

  • pulp-engine test <template> [--update-snapshots] [--reporter junit] [--fixture-dir ./tests]
  • pulp-engine promote <template> --label <name> --version <ver>

Staging-promotion gate — CLI/CI only, NOT server-enforced in v1. A client-supplied testReport on a label PUT is trivially forgeable. Rather than build an unverifiable pseudo-gate, v1 keeps enforcement in CI:

  • Label promotion requires admin scope (enforced at auth.plugin.ts:61).
  • Every promotion writes an audit event with actor, from-version, to-version, label, credential scope.
  • An optional X-PulpEngine-Test-Run-Id header is recorded in the audit event but never validated — it’s a forensic breadcrumb so operators can correlate a promotion with a CI run. The audit write is fire-and-forget best-effort: a failure to write the breadcrumb is logged but the promotion still succeeds.

Customers wire pulp-engine test && pulp-engine promote into CI and protect the admin API key so only CI can promote. Same trust model as “CI runs tests before npm publish” — the registry doesn’t verify tests either. Sample workflow at .github/workflows/template-ci.sample.yml showing the pattern: test → promote to staging → manual approval → promote to prod.

Server-side verification (pixelmatch/puppeteer/PDF-raster in the production image) was rejected as operationally expensive. Signed test reports (service-account key signing a test manifest) is feasible as a v2 follow-up once customers ask for it.

Full customer story in docs/template-testing.md.

B.3 — A/B variants (#2)

Variants are an extension of the TemplateLabel model — not a new table. A variant IS a label with two extra columns: trafficSplit (0..100) and controlLabel (fallback label name).

Resolution on a render call with label: 'checkout-test':

  1. Load checkout-test label.
  2. If trafficSplit === null → plain label, use its version.
  3. Hash bucketKey → first 8 bytes of sha256 as uint32 → mod 100.
  4. If hash < trafficSplit → use the variant version. Else recurse with label = controlLabel.

Bucket key precedence:

  1. request.body.bucketKey (explicit)
  2. x-pulp-engine-bucket header
  3. Deterministic fallback: sha256(templateKey + ':' + JSON.stringify(data)) — logged as a structured warning, but stable so identical inputs don’t flicker between variant and control.

Response headers for debugging: X-PulpEngine-Resolved-Label, X-PulpEngine-Resolved-Version, X-PulpEngine-Bucket (variant or control). Wired through all render verbs.

Guardrails:

  • Reserved labels prod and staging cannot have trafficSplit — enforced in the PUT handler at labels.ts.
  • controlLabel must exist at PUT time (prevents dangling pointers that would fail at render time).
  • Variant-of-variant chaining is rejected (400) — the bucketing pass only resolves one hop, so chaining would silently fall through to the inner variant’s control label instead of bucketing through both. This was surfaced during the B.1a review and the server was tightened to match the editor’s long-standing assumption (the LabelsPanel dropdown has always filtered variant labels out of the control-label picker).
  • controlLabel cannot equal the label being created (no self-loop).
  • Render-time fallback: if the variant version is deleted after promotion, log and fall through to the control label’s version.

Feature flag: shipped behind VARIANTS_ENABLED=false for first-release internal validation. Flip on after validation.

Implementation in variant-bucketing.ts with determinism tests on 10k keys and uniform-distribution verification (within 2% at a 50% split) in variant-bucketing.test.ts.

Known limitation (symmetric-guard gap): the server rejects the downstream attempt — creating Y with controlLabel=X when X is a variant. It does NOT reject the reverse — converting an existing plain label X into a variant when some other label Y already has controlLabel === X leaves Y dangling. The cheap fix is, on a PUT that adds trafficSplit, walk existing labels and reject if any has controlLabel === request.params.label. Low priority (operators would have to actively try this) but filed for follow-up.

B.1a — Editor label promotion UI

The API side of Phase B was complete, but the editor still had no UI for labels — customers could only promote from the CLI or curl. B.1a closes that gap so the full Phase B customer story is end-to-end.

New LabelsPanel dialog at LabelsPanel.tsx:

  • Lists every label pointer on the current template, sorted prodstaging → alphabetical.
  • Shows variant N% chips with a tooltip explaining the control label.
  • Create / re-point / delete affordances gated on admin scope.
  • Variant form with traffic-split slider and control-label dropdown.
  • Handles 412 (re-pointed concurrently — reload) and 403 (requires admin scope — use CLI) with classified error messages.
  • Mounted from the editor overflow menu in EditorHeader.tsx alongside “History”, gated on isApiTemplate.

Version history label chips: VersionHistoryModal.tsx now fetches labels once per open and decorates each version row with a chip per pointing label (variant chips include the split %). Fetch failures are silent — labels are advisory, a 403 for editor-scope sessions must not break version history.

Session scope in the editor — the key enabler for admin-gated UI:

The scope blocker: shared-key editor tokens always verify as editor (see editor-token.provider.ts), so those users cannot promote labels from the editor regardless of which API key they logged in with. Only named-user admin-role users and OIDC admin-group users get admin-scope editor sessions. This is the correct security posture, not a bug — but the editor needed to know its own scope to hide affordances that would 403 anyway.

Wired through every mint path:

Editor side: setStoredToken takes an optional 5th arg, persisted under pulp-engine.editorScope in sessionStorage. New getStoredScope() returns 'admin' | 'editor' | null. Null means “unknown” (pre-upgrade session) — UI treats it as read-only rather than guessing. All call sites updated: LoginGate.tsx, embed-main.tsx, the OIDC handleOidcCallback in auth.ts.

Test coverage: 17 new tests at LabelsPanel.test.tsx — list view with sort order and variant chip, admin vs editor vs null scope gating, delete confirm flow (including cancel), plain-label re-point with If-Match, variant re-point preserving trafficSplit + controlLabel, 403/412 classification, create plain-label flow with list reload, create variant flow, client-side missing-control guard, and a regression test for the “tick variant → type reserved label name → form collapses and submits as plain label” edge case. All tests updated for the new setStoredToken(..., scope) signature.

Intentionally NOT done:

  • PublishGateDialog integration — publish creates a version; label promotion is a separate operator act. Bundling them would obscure the “publish doesn’t go to prod” semantic that staging/prod exists to create.
  • Symmetric variant-of-variant guard — filed as a Phase B.3 follow-up above.
  • SDK regeneration — deferred to batch with Phase C Tenant primitive to avoid double-regenerating.
  • Plugin event emission for label_set/label_deleted — requires extending PulpEngineEventMap in @pulp-engine/plugin-api, tracked as a follow-up. Subscribers learn about label changes via the audit trail today.
  • Schedule ref validation for templateVersion/templateLabel on schedule create/PUT/PATCH — still only validates templateKey existence at the route layer. Carried across B.1 → B.2 → B.3 → B.1a.

Migration notes

  • Database migrations are additive. The new template_labels and schedule_delivery_dlq tables are created fresh; no existing tables are modified. A one-shot data migration seeds prod label = currentVersion for every existing template so legacy render calls (no label, no version) keep working exactly as before.
  • Render API is backward-compatible. The label field is optional on every render body schema; existing version and “no selector” call shapes are unchanged.
  • Auth responses are additive. EditorTokenResponse.scope is a new field; existing clients that don’t know about it just ignore it. Editor sessions established before this release won’t have a stored scope — the editor treats null as “unknown” and shows labels read-only, which is correct: the admin mutation requires admin scope regardless of what the UI displays.
  • No breaking env changes. Every new env var has a safe default. SCHEDULE_DLQ_ENABLED is auto-enabled in Postgres mode only; VARIANTS_ENABLED defaults to false; METRICS_TEMPLATE_LABEL_MODE defaults to topn with a 50-key budget.

Rollback

Per-feature rollback:

FeatureHow to roll back
A.1 metrics labelMETRICS_TEMPLATE_LABEL_MODE=off — label always emits __other__, dashboards keep working
A.2 preview laneRENDER_PREVIEW_RESERVED_SLOTS=0 — collapses to single-semaphore behavior exactly
A.3 schedule DLQSCHEDULE_DLQ_ENABLED=false — dispatcher skips the onExhausted hook; existing table is inert
B.1 labelsDelete the label row (no render callers are forced to use labels)
B.3 variantsVARIANTS_ENABLED=false — render resolution ignores trafficSplit and uses the variant label’s own version directly
B.1a editorEditor falls back to read-only if scope is missing; no data rollback needed

Full rollback of the migration requires dropping template_labels and schedule_delivery_dlq — straightforward since neither is referenced by any pre-existing column.

Known flakiness (carryover from prior releases)

Several apps/api tests that boot a full Fastify app have intermittent Windows timeout failures on cold-cache runs — documented in the prior-release follow-up list and confirmed reproducible on stashed pre-B.1a main. The files observed flaking are auth-scopes.test.ts, audit-events.test.ts, named-users.test.ts, editor-session.test.ts, render-batch.test.ts, render-preview.test.ts, and schedules.test.ts. All pass cleanly when run in isolation or on warm caches. Not regressions — pre-existing Windows-specific startup timing.

Phase A + B targeted tests for this release pass clean in isolation on Windows:

  • template-resolution.test.ts, variant-bucketing.test.ts, metrics-template-label.test.ts, concurrency-cap.test.ts, concurrency-cap-disabled.test.ts: 59/59
  • schedule-dlq.test.ts, template-labels.route.test.ts: 53/53 (includes the new variant-of-variant rejection test)
  • schedule-engine.test.ts, delivery-dispatcher.test.ts, file-template.store.test.ts: 95/95
  • Editor suite full run: 990/990 (includes 17 new B.1a tests)

Publishing note

SDK publishing workflows (publish-sdk-typescript.yml, publish-sdk-python.yml, publish-sdk-dotnet.yml, publish-sdk-go.yml, Docker image publish) will not fire automatically when this tag is pushed — GitHub Actions is disabled on the repo through 2026-05-01 due to a billing issue. Operators should either:

  1. Wait to push the tag until Actions is restored, then push normally and let the workflows fire on the tag push, or
  2. Push the tag now and manually invoke the publish workflows (or their local equivalents) after Actions is restored. Tag immutability still applies — the published artifacts must match the commit at the tag.

v0.65.0 is committed and tagged locally. No push has happened.

Follow-ups filed

  • A.2b — wire renderQueueDepth into child-process/container/socket dispatcher modes
  • A.3b — async batch webhook DLQ (currently log-only)
  • B.3 symmetric guard — reject converting a plain label into a variant when another label already has it as a control
  • SDK regeneration — batch with Phase C Tenant primitive
  • Plugin event emissionlabel_set / label_deleted on PulpEngineEventMap
  • Schedule ref validationtemplateVersion/templateLabel existence on schedule create/PUT/PATCH
  • Signed test reports (B.2 v2) — service-account key signing a test manifest, if customers ask for it