Pulp Engine — HA / Clustering Reference Architecture
Reference architecture for running Pulp Engine across multiple API replicas behind a load balancer. Pairs with deployment-guide.md (single-instance topology) and runbook.md (operational procedures).
This document is enterprise-oriented: it assumes a managed Postgres, an object store (S3 / MinIO / R2), and an HTTPS-terminating load balancer are available. For single-instance or evaluation deployments, use the simpler topologies in the deployment guide.
1. Topology
┌───────────────────────┐
│ HTTPS Load Balancer │ (sticky sessions NOT required)
│ TLS termination │
└───────────┬───────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ API pod │ │ API pod │ ... │ API pod │
│ N=1 │ │ N=2 │ │ N=k │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
├─────────────────┴─────────────────┤
▼ ▼
┌───────────────┐ ┌─────────────────┐
│ Postgres │ │ Object store │
│ (primary + │ │ (S3 / MinIO / │
│ replicas) │ │ R2 — shared) │
└───────────────┘ └─────────────────┘
Key properties:
- Request handlers are stateless — no sticky sessions required.
- Editor session tokens are HMAC-signed (
apps/api/src/lib/editor-token.ts) — no session store. There is no separateEDITOR_TOKEN_SECRET: tokens are signed and verified against the active API-key credentials, which must already be identical across pods (see § 3 and the “Editor token signing” note below). - All durable state lives in Postgres + the object store. Nothing on local pod disk is authoritative.
2. Stateless vs Stateful Components
Stateless (scale horizontally without coordination)
- HTTP request handlers
- Editor session tokens (5-part HMAC-signed, no storage)
- OIDC auth code flow (stateless completion-code delivery)
- Capability responses
- Template / asset / render routes
Shared state (authoritative — all pods read/write)
- Postgres — templates, versions, labels, assets metadata, audit events, schedules + executions + DLQ, tenant registry, render usage. Schema:
apps/api/src/prisma/schema.prisma. - Object store — asset binaries (
ASSET_BINARY_STORE=s3).
Per-pod state (multi-instance safe — see notes below)
| Component | File | Multi-instance behaviour |
|---|---|---|
| Schedule dispatcher | apps/api/src/lib/schedule-engine.ts | Each pod polls independently; DB row-level claim via INSERT … ON CONFLICT … SKIP LOCKED guarantees a given schedule execution fires exactly once across the cluster. No leader election required. |
| TenantStatusCache | apps/api/src/lib/tenant-status-cache.ts | Per-pod TTL cache (default 10 s). Tenant archive operations have a ≤ TTL staleness window before all pods converge. Tune via TENANT_STATUS_CACHE_TTL_MS. Acceptable for typical workloads; set to a lower value if you need stricter archive propagation. |
| Audit-purge scheduler | apps/api/src/lib/audit-purge-scheduler.ts | Runs per pod. Idempotent — all pods issue the same DELETE WHERE timestamp < cutoff; duplicate work is harmless but wasteful. Consider disabling on all but one pod in very large deployments (operator choice). |
| Render-usage-purge scheduler | apps/api/src/lib/render-usage-purge-scheduler.ts | Same pattern as audit purge — idempotent, safe across pods. |
| Browser singleton (child-process render mode) | apps/api/src/server.ts | Chromium instance warmed per pod. Cannot be shared cross-process. See § 4. |
| Delivery dispatcher batch job store | apps/api/src/lib/delivery/dispatcher.ts | Known limitation: in-flight batch jobs held in-memory are lost if the pod restarts mid-batch. The DLQ is persisted to Postgres; permanent failures are not lost. Treat batch deliveries as best-effort across pod restarts. |
Not applicable in HA
- File storage modes (
STORAGE_MODE=file,ASSET_BINARY_STORE=filesystem) — assume a single writer. Do not run multiple API pods against a shared filesystem; use Postgres + S3 instead.
3. Required Configuration
All pods must share the following values:
| Variable | Value | Notes |
|---|---|---|
STORAGE_MODE | postgres (or sqlserver) | File mode is not HA-safe |
DATABASE_URL | Managed Postgres primary | Point replicas at the primary; Prisma does not currently split reads |
ASSET_BINARY_STORE | s3 | Required — shared-volume NFS mode also works but S3 is the reference |
S3_BUCKET, S3_REGION, S3_ACCESS_KEY_ID, S3_SECRET_ACCESS_KEY, S3_ENDPOINT | Shared across pods | See deployment-guide.md § Object Storage |
API_KEY_ADMIN, API_KEY_EDITOR, API_KEY_RENDER, API_KEY_PREVIEW | Identical across pods | API_KEY_ADMIN/API_KEY_EDITOR also sign editor tokens — a token minted by pod A must verify on pod B, so these must match (see “Editor token signing” below) |
TRUST_PROXY | true | LB terminates TLS; real client IP in X-Forwarded-For |
REQUIRE_HTTPS | true | Enforce the LB redirect contract |
TENANT_STATUS_CACHE_TTL_MS | 10000 (default) or lower | See staleness note in § 2 |
APP_VERSION | Same across pods | Prevents mixed-version surprises in /readyz and capability responses |
Editor token signing
Editor session tokens are HMAC-signed and verified against the cluster’s
active credential set, not a dedicated secret. The candidate secrets
(see editorCapableSecrets in auth.plugin.ts) are:
API_KEY_EDITORandAPI_KEY_ADMIN(the primary signers),- the legacy
API_KEY(when running in legacy single-key mode), API_KEY_SUPER_ADMIN,- any
API_KEYS_JSON/API_KEYS_JSON_FILEentry withadminoreditorscope, - the verify-only rollover keys
API_KEY_EDITOR_PREVIOUS/API_KEY_ADMIN_PREVIOUS, - in OIDC-only deployments (no API keys), a secret derived from
OIDC_COOKIE_SECRET.
Operational consequences:
- These must be identical across pods so a token minted on one verifies on another.
- Rotating a signing key invalidates every in-flight editor token immediately (not just new mints). For a graceful rollover, set the new key, then carry the old value in
API_KEY_*_PREVIOUS(verify-only) until outstanding tokens expire, then remove it. - There is no
EDITOR_TOKEN_SECRET— if you have set it in an env file or compose, it is inert and can be removed.
Rollout and rotation
- API key rotation under HA — use the documented
API_KEY_*_PREVIOUSverify-only rollover variables. Set the new key on all pods first, then*_PREVIOUSon all pods, then swap clients over, then remove*_PREVIOUS. - Editor token invalidation —
EDITOR_TOKEN_ISSUED_AFTERis a shared cutover: set it on all pods at the same timestamp and all existing tokens are rejected cluster-wide on the next request.
4. Render Isolation in HA
The rendering layer has three modes (child-process, container, socket). For HA:
RENDER_MODE | Recommended for HA? | Notes |
|---|---|---|
child-process (default) | ✅ per-pod | Each pod warms its own Chromium. Safe and simple. |
container | ✅ | Each pod spawns a render container per request. Requires Docker socket; use cautiously (privileged). |
socket | ✅ (most isolated) | API pod has no Docker socket; a dedicated controller pod does. Best privilege separation. |
Recommendation: start with child-process mode unless you have a specific privilege-separation requirement. Scale the API pods horizontally; render capacity scales with pod count.
5. Known Limitations
- Batch delivery jobs are in-memory per pod — pod restart mid-batch loses in-flight job state (DLQ still captures permanent failures).
- Audit and render-usage purge schedulers run per pod — harmless duplicate work. If this shows up in DB load metrics, operator may disable on all but one pod via env-var gating (not currently exposed — follow-up).
- TenantStatusCache staleness window (default 10 s) — archive-a-tenant propagation is eventually consistent within TTL.
- No read replicas — Prisma is configured against a single
DATABASE_URL. Under very high read load, scale Postgres vertically or add a read-replica-aware proxy (PgBouncer + per-query routing) in front of the database; the app does not partition reads itself.
6. Reference Compose
A reference docker-compose.ha.yml is provided at the repo root.
This is a demo / evaluation stack, not a production reference. It exists to make the validation exercise in § 7 reproducible on a single host and to show the wiring. For production:
- Replace MinIO with managed S3.
- Replace the Postgres container with a managed Postgres service (backups, HA, PITR).
- Replace the simple LB container with your production ingress (ALB, GCLB, nginx, Traefik, etc.).
- Store secrets in your platform’s secret manager, not the compose file.
See docker-compose.ha.yml for the stack and docs/ha-validation-report.md for the validation results.
7. Validation Checklist
Mixed automated + manual coverage. Items marked automated run in
.github/workflows/ha-nightly.yml
against a fresh docker-compose.ha.yml stack every night and on
workflow_dispatch. Manual items are smoke tests — rerun after major
version upgrades or infrastructure changes. Results captured in
ha-validation-report.md.
- Shared asset readability (manual). Upload an asset via pod A, render a template referencing it via pod B. Expect: same asset bytes returned in the PDF.
- Schedule fires exactly once (automated —
scripts/ha/check-2-schedule-fires-once.mjs). Configure a cron schedule; start 2+ pods; wait for three ticks. Query/schedules/:id/executions— expect exactly one row per scheduled tick (not one per pod). - Editor token cross-pod (manual). Mint an editor token via pod A (
POST /auth/editor-token), submit a template mutation via pod B with that token. Expect: 200 + audit row attributed to the minter. - Graceful degradation (partially covered by Check 7 below). Kill one pod mid-request; expect surviving pod continues to serve. The automated Check 7 quantifies this for the outage shape (process down, container present); rolling-replacement semantics with new container IPs are not yet automated and remain a manual smoke test.
- Tenant archive propagation (manual). In multi-tenant mode, archive a tenant via pod A. Wait
TENANT_STATUS_CACHE_TTL_MS. Expect: write attempts via pod B are rejected. - Key rotation (automated, editor-key variant —
scripts/ha/check-6-api-key-rotation.mjs). Drives a four-stage rotation lifecycle (initial → api1 rotated → both rotated → grace ended) and asserts the_PREVIOUSrollover contract via direct per-replica probes plus a restart-window log scan. The manual playbook in ha-validation-report.md covers admin-key rotation; the automation exercises the same contract on the editor key. - Single-replica outage failover (automated —
scripts/ha/check-7-outage-failover.mjs). Sustained load against the LB while each replica isdocker compose stopped andstarted in turn. Asserts ≥99% success rate across the full window — exercises nginx’sproxy_next_upstreamretry behavior on connection-refused. Scope note: this exercises the “process down, container present” outage shape (stable container identity, IP, hostname). It does NOT exercise rolling-replacement semantics where each replica is--force-recreated and gets a new IP — the demo nginx config resolves api1/api2 hostnames once at startup, so recreate semantics need a real ingress (k8s Service, ALB, or nginx with aresolverdirective). That gap remains a manual smoke test. - Redis-backed rate-limit shared bucket (automated —
scripts/ha/check-8-rate-limit-shared.mjs). The basedocker-compose.ha.ymlalready ships a Redis service and setsRATE_LIMIT_STORE=redison both replicas (shared rate limiting is the default HA posture); the CI overlaydocker-compose.ha.redis.ci.ymlonly tightensRATE_LIMIT_MAXand exposes direct per-replica ports for the probe. The check exhausts the rate-limit bucket on api1 (5 priming requests + 1 sanity 429), then sends one request to api2 with the same client identity and asserts 429 — proving the bucket is genuinely shared via Redis rather than per-instance. A fresh-client probe on api2 (differentX-Forwarded-For) returns 200, ruling out a “throttle everything” pathological mode. The bucket key derivation (${req.ip}:${routeClass}in single-tenant mode, prefixed with the @fastify/rate-limitpulp-engine-rl-namespace) is documented in the driver header and the overlay file in lockstep.
Editor user-registry consistency (v0.81.0: DB-backed, shared across replicas)
In STORAGE_MODE=postgres/sqlserver the named-user registry is DB-backed and shared across
replicas (editor_users table). Each API instance runs an in-memory cache over the table with
async read-through on a miss + a periodic full reload (EDITOR_USERS_CACHE_TTL_MS,
default 10 s):
- New / OIDC auto-provisioned users are visible on every replica immediately — a cache miss reads through to the shared table. OIDC provision races are resolved by the table’s unique constraints (the same subject reconciles; an id/key collision regenerates).
- Role changes,
tokenIssuedAfterrevocations, and deletes propagate to other replicas within the cache TTL (the originating replica is immediate). This bounded staleness is the only cross-replica lag; keepEDITOR_USERS_CACHE_TTL_MSlow for revocation-sensitive deployments. EDITOR_USERS_JSON/EDITOR_USERS_FILEseed an empty table on first boot (under a seed-only-when-empty guard, so a deleted user is never resurrected); after that the DB is authoritative. SetEDITOR_USERS_DB=trueto enable DB-backed named-user mode with no JSON/FILE seed.
File mode remains single-instance (flat-file registry), so the cross-replica concern does not apply there.
Still deferred to a separate follow-up batch:
- Rolling-replacement (
--force-recreate) failover semantics — needs a real ingress layer in the demo stack to be testable. - Multi-instance asset / S3 read-after-write consistency (the manual Check 1 above; Workstream D Phase 2).