Pulp Engine Document Rendering
Get started

Pulp Engine — HA / Clustering Reference Architecture

Reference architecture for running Pulp Engine across multiple API replicas behind a load balancer. Pairs with deployment-guide.md (single-instance topology) and runbook.md (operational procedures).

This document is enterprise-oriented: it assumes a managed Postgres, an object store (S3 / MinIO / R2), and an HTTPS-terminating load balancer are available. For single-instance or evaluation deployments, use the simpler topologies in the deployment guide.


1. Topology

                      ┌───────────────────────┐
                      │   HTTPS Load Balancer │  (sticky sessions NOT required)
                      │   TLS termination     │
                      └───────────┬───────────┘

                ┌─────────────────┼─────────────────┐
                ▼                 ▼                 ▼
           ┌─────────┐       ┌─────────┐       ┌─────────┐
           │ API pod │       │ API pod │  ...  │ API pod │
           │   N=1   │       │   N=2   │       │   N=k   │
           └────┬────┘       └────┬────┘       └────┬────┘
                │                 │                 │
                ├─────────────────┴─────────────────┤
                ▼                                   ▼
        ┌───────────────┐                 ┌─────────────────┐
        │  Postgres     │                 │  Object store   │
        │  (primary +   │                 │  (S3 / MinIO /  │
        │   replicas)   │                 │   R2 — shared)  │
        └───────────────┘                 └─────────────────┘

Key properties:

  • Request handlers are stateless — no sticky sessions required.
  • Editor session tokens are HMAC-signed (apps/api/src/lib/editor-token.ts) — no session store. There is no separate EDITOR_TOKEN_SECRET: tokens are signed and verified against the active API-key credentials, which must already be identical across pods (see § 3 and the “Editor token signing” note below).
  • All durable state lives in Postgres + the object store. Nothing on local pod disk is authoritative.

2. Stateless vs Stateful Components

Stateless (scale horizontally without coordination)

  • HTTP request handlers
  • Editor session tokens (5-part HMAC-signed, no storage)
  • OIDC auth code flow (stateless completion-code delivery)
  • Capability responses
  • Template / asset / render routes

Shared state (authoritative — all pods read/write)

  • Postgres — templates, versions, labels, assets metadata, audit events, schedules + executions + DLQ, tenant registry, render usage. Schema: apps/api/src/prisma/schema.prisma.
  • Object store — asset binaries (ASSET_BINARY_STORE=s3).

Per-pod state (multi-instance safe — see notes below)

ComponentFileMulti-instance behaviour
Schedule dispatcherapps/api/src/lib/schedule-engine.tsEach pod polls independently; DB row-level claim via INSERT … ON CONFLICT … SKIP LOCKED guarantees a given schedule execution fires exactly once across the cluster. No leader election required.
TenantStatusCacheapps/api/src/lib/tenant-status-cache.tsPer-pod TTL cache (default 10 s). Tenant archive operations have a ≤ TTL staleness window before all pods converge. Tune via TENANT_STATUS_CACHE_TTL_MS. Acceptable for typical workloads; set to a lower value if you need stricter archive propagation.
Audit-purge schedulerapps/api/src/lib/audit-purge-scheduler.tsRuns per pod. Idempotent — all pods issue the same DELETE WHERE timestamp < cutoff; duplicate work is harmless but wasteful. Consider disabling on all but one pod in very large deployments (operator choice).
Render-usage-purge schedulerapps/api/src/lib/render-usage-purge-scheduler.tsSame pattern as audit purge — idempotent, safe across pods.
Browser singleton (child-process render mode)apps/api/src/server.tsChromium instance warmed per pod. Cannot be shared cross-process. See § 4.
Delivery dispatcher batch job storeapps/api/src/lib/delivery/dispatcher.tsKnown limitation: in-flight batch jobs held in-memory are lost if the pod restarts mid-batch. The DLQ is persisted to Postgres; permanent failures are not lost. Treat batch deliveries as best-effort across pod restarts.

Not applicable in HA

  • File storage modes (STORAGE_MODE=file, ASSET_BINARY_STORE=filesystem) — assume a single writer. Do not run multiple API pods against a shared filesystem; use Postgres + S3 instead.

3. Required Configuration

All pods must share the following values:

VariableValueNotes
STORAGE_MODEpostgres (or sqlserver)File mode is not HA-safe
DATABASE_URLManaged Postgres primaryPoint replicas at the primary; Prisma does not currently split reads
ASSET_BINARY_STOREs3Required — shared-volume NFS mode also works but S3 is the reference
S3_BUCKET, S3_REGION, S3_ACCESS_KEY_ID, S3_SECRET_ACCESS_KEY, S3_ENDPOINTShared across podsSee deployment-guide.md § Object Storage
API_KEY_ADMIN, API_KEY_EDITOR, API_KEY_RENDER, API_KEY_PREVIEWIdentical across podsAPI_KEY_ADMIN/API_KEY_EDITOR also sign editor tokens — a token minted by pod A must verify on pod B, so these must match (see “Editor token signing” below)
TRUST_PROXYtrueLB terminates TLS; real client IP in X-Forwarded-For
REQUIRE_HTTPStrueEnforce the LB redirect contract
TENANT_STATUS_CACHE_TTL_MS10000 (default) or lowerSee staleness note in § 2
APP_VERSIONSame across podsPrevents mixed-version surprises in /readyz and capability responses

Editor token signing

Editor session tokens are HMAC-signed and verified against the cluster’s active credential set, not a dedicated secret. The candidate secrets (see editorCapableSecrets in auth.plugin.ts) are:

  • API_KEY_EDITOR and API_KEY_ADMIN (the primary signers),
  • the legacy API_KEY (when running in legacy single-key mode),
  • API_KEY_SUPER_ADMIN,
  • any API_KEYS_JSON / API_KEYS_JSON_FILE entry with admin or editor scope,
  • the verify-only rollover keys API_KEY_EDITOR_PREVIOUS / API_KEY_ADMIN_PREVIOUS,
  • in OIDC-only deployments (no API keys), a secret derived from OIDC_COOKIE_SECRET.

Operational consequences:

  • These must be identical across pods so a token minted on one verifies on another.
  • Rotating a signing key invalidates every in-flight editor token immediately (not just new mints). For a graceful rollover, set the new key, then carry the old value in API_KEY_*_PREVIOUS (verify-only) until outstanding tokens expire, then remove it.
  • There is no EDITOR_TOKEN_SECRET — if you have set it in an env file or compose, it is inert and can be removed.

Rollout and rotation

  • API key rotation under HA — use the documented API_KEY_*_PREVIOUS verify-only rollover variables. Set the new key on all pods first, then *_PREVIOUS on all pods, then swap clients over, then remove *_PREVIOUS.
  • Editor token invalidationEDITOR_TOKEN_ISSUED_AFTER is a shared cutover: set it on all pods at the same timestamp and all existing tokens are rejected cluster-wide on the next request.

4. Render Isolation in HA

The rendering layer has three modes (child-process, container, socket). For HA:

RENDER_MODERecommended for HA?Notes
child-process (default)✅ per-podEach pod warms its own Chromium. Safe and simple.
containerEach pod spawns a render container per request. Requires Docker socket; use cautiously (privileged).
socket✅ (most isolated)API pod has no Docker socket; a dedicated controller pod does. Best privilege separation.

Recommendation: start with child-process mode unless you have a specific privilege-separation requirement. Scale the API pods horizontally; render capacity scales with pod count.


5. Known Limitations

  1. Batch delivery jobs are in-memory per pod — pod restart mid-batch loses in-flight job state (DLQ still captures permanent failures).
  2. Audit and render-usage purge schedulers run per pod — harmless duplicate work. If this shows up in DB load metrics, operator may disable on all but one pod via env-var gating (not currently exposed — follow-up).
  3. TenantStatusCache staleness window (default 10 s) — archive-a-tenant propagation is eventually consistent within TTL.
  4. No read replicas — Prisma is configured against a single DATABASE_URL. Under very high read load, scale Postgres vertically or add a read-replica-aware proxy (PgBouncer + per-query routing) in front of the database; the app does not partition reads itself.

6. Reference Compose

A reference docker-compose.ha.yml is provided at the repo root.

This is a demo / evaluation stack, not a production reference. It exists to make the validation exercise in § 7 reproducible on a single host and to show the wiring. For production:

  • Replace MinIO with managed S3.
  • Replace the Postgres container with a managed Postgres service (backups, HA, PITR).
  • Replace the simple LB container with your production ingress (ALB, GCLB, nginx, Traefik, etc.).
  • Store secrets in your platform’s secret manager, not the compose file.

See docker-compose.ha.yml for the stack and docs/ha-validation-report.md for the validation results.


7. Validation Checklist

Mixed automated + manual coverage. Items marked automated run in .github/workflows/ha-nightly.yml against a fresh docker-compose.ha.yml stack every night and on workflow_dispatch. Manual items are smoke tests — rerun after major version upgrades or infrastructure changes. Results captured in ha-validation-report.md.

  1. Shared asset readability (manual). Upload an asset via pod A, render a template referencing it via pod B. Expect: same asset bytes returned in the PDF.
  2. Schedule fires exactly once (automatedscripts/ha/check-2-schedule-fires-once.mjs). Configure a cron schedule; start 2+ pods; wait for three ticks. Query /schedules/:id/executions — expect exactly one row per scheduled tick (not one per pod).
  3. Editor token cross-pod (manual). Mint an editor token via pod A (POST /auth/editor-token), submit a template mutation via pod B with that token. Expect: 200 + audit row attributed to the minter.
  4. Graceful degradation (partially covered by Check 7 below). Kill one pod mid-request; expect surviving pod continues to serve. The automated Check 7 quantifies this for the outage shape (process down, container present); rolling-replacement semantics with new container IPs are not yet automated and remain a manual smoke test.
  5. Tenant archive propagation (manual). In multi-tenant mode, archive a tenant via pod A. Wait TENANT_STATUS_CACHE_TTL_MS. Expect: write attempts via pod B are rejected.
  6. Key rotation (automated, editor-key variantscripts/ha/check-6-api-key-rotation.mjs). Drives a four-stage rotation lifecycle (initial → api1 rotated → both rotated → grace ended) and asserts the _PREVIOUS rollover contract via direct per-replica probes plus a restart-window log scan. The manual playbook in ha-validation-report.md covers admin-key rotation; the automation exercises the same contract on the editor key.
  7. Single-replica outage failover (automatedscripts/ha/check-7-outage-failover.mjs). Sustained load against the LB while each replica is docker compose stopped and started in turn. Asserts ≥99% success rate across the full window — exercises nginx’s proxy_next_upstream retry behavior on connection-refused. Scope note: this exercises the “process down, container present” outage shape (stable container identity, IP, hostname). It does NOT exercise rolling-replacement semantics where each replica is --force-recreated and gets a new IP — the demo nginx config resolves api1/api2 hostnames once at startup, so recreate semantics need a real ingress (k8s Service, ALB, or nginx with a resolver directive). That gap remains a manual smoke test.
  8. Redis-backed rate-limit shared bucket (automatedscripts/ha/check-8-rate-limit-shared.mjs). The base docker-compose.ha.yml already ships a Redis service and sets RATE_LIMIT_STORE=redis on both replicas (shared rate limiting is the default HA posture); the CI overlay docker-compose.ha.redis.ci.yml only tightens RATE_LIMIT_MAX and exposes direct per-replica ports for the probe. The check exhausts the rate-limit bucket on api1 (5 priming requests + 1 sanity 429), then sends one request to api2 with the same client identity and asserts 429 — proving the bucket is genuinely shared via Redis rather than per-instance. A fresh-client probe on api2 (different X-Forwarded-For) returns 200, ruling out a “throttle everything” pathological mode. The bucket key derivation (${req.ip}:${routeClass} in single-tenant mode, prefixed with the @fastify/rate-limit pulp-engine-rl- namespace) is documented in the driver header and the overlay file in lockstep.

Editor user-registry consistency (v0.81.0: DB-backed, shared across replicas)

In STORAGE_MODE=postgres/sqlserver the named-user registry is DB-backed and shared across replicas (editor_users table). Each API instance runs an in-memory cache over the table with async read-through on a miss + a periodic full reload (EDITOR_USERS_CACHE_TTL_MS, default 10 s):

  • New / OIDC auto-provisioned users are visible on every replica immediately — a cache miss reads through to the shared table. OIDC provision races are resolved by the table’s unique constraints (the same subject reconciles; an id/key collision regenerates).
  • Role changes, tokenIssuedAfter revocations, and deletes propagate to other replicas within the cache TTL (the originating replica is immediate). This bounded staleness is the only cross-replica lag; keep EDITOR_USERS_CACHE_TTL_MS low for revocation-sensitive deployments.
  • EDITOR_USERS_JSON/EDITOR_USERS_FILE seed an empty table on first boot (under a seed-only-when-empty guard, so a deleted user is never resurrected); after that the DB is authoritative. Set EDITOR_USERS_DB=true to enable DB-backed named-user mode with no JSON/FILE seed.

File mode remains single-instance (flat-file registry), so the cross-replica concern does not apply there.

Still deferred to a separate follow-up batch:

  • Rolling-replacement (--force-recreate) failover semantics — needs a real ingress layer in the demo stack to be testable.
  • Multi-instance asset / S3 read-after-write consistency (the manual Check 1 above; Workstream D Phase 2).