Pulp Engine — HA / Clustering Reference Architecture

Reference architecture for running Pulp Engine across multiple API replicas behind a load balancer. Pairs with deployment-guide.md (single-instance topology) and runbook.md (operational procedures).

This document is enterprise-oriented: it assumes a managed Postgres, an object store (S3 / MinIO / R2), and an HTTPS-terminating load balancer are available. For single-instance or evaluation deployments, use the simpler topologies in the deployment guide.

1. Topology

                      ┌───────────────────────┐
                      │   HTTPS Load Balancer │  (sticky sessions NOT required)
                      │   TLS termination     │
                      └───────────┬───────────┘
                                  │
                ┌─────────────────┼─────────────────┐
                ▼                 ▼                 ▼
           ┌─────────┐       ┌─────────┐       ┌─────────┐
           │ API pod │       │ API pod │  ...  │ API pod │
           │   N=1   │       │   N=2   │       │   N=k   │
           └────┬────┘       └────┬────┘       └────┬────┘
                │                 │                 │
                ├─────────────────┴─────────────────┤
                ▼                                   ▼
        ┌───────────────┐                 ┌─────────────────┐
        │  Postgres     │                 │  Object store   │
        │  (primary +   │                 │  (S3 / MinIO /  │
        │   replicas)   │                 │   R2 — shared)  │
        └───────────────┘                 └─────────────────┘

Key properties:

Request handlers are stateless — no sticky sessions required.
Editor session tokens are HMAC-signed (apps/api/src/lib/editor-token.ts) — no session store. There is no separate EDITOR_TOKEN_SECRET: tokens are signed and verified against the active API-key credentials, which must already be identical across pods (see § 3 and the “Editor token signing” note below).
All durable state lives in Postgres + the object store. Nothing on local pod disk is authoritative.

2. Stateless vs Stateful Components

Stateless (scale horizontally without coordination)

HTTP request handlers
Editor session tokens (5-part HMAC-signed, no storage)
OIDC auth code flow (stateless completion-code delivery)
Capability responses
Template / asset / render routes

Shared state (authoritative — all pods read/write)

Postgres — templates, versions, labels, assets metadata, audit events, schedules + executions + DLQ, tenant registry, render usage. Schema: apps/api/src/prisma/schema.prisma.
Object store — asset binaries (ASSET_BINARY_STORE=s3).

Per-pod state (multi-instance safe — see notes below)

Component	File	Multi-instance behaviour
Schedule dispatcher	`apps/api/src/lib/schedule-engine.ts`	Each pod polls independently; DB row-level claim via `INSERT … ON CONFLICT … SKIP LOCKED` guarantees a given schedule execution fires exactly once across the cluster. No leader election required.
TenantStatusCache	`apps/api/src/lib/tenant-status-cache.ts`	Per-pod TTL cache (default 10 s). Tenant archive operations have a ≤ TTL staleness window before all pods converge. Tune via `TENANT_STATUS_CACHE_TTL_MS`. Acceptable for typical workloads; set to a lower value if you need stricter archive propagation.
Audit-purge scheduler	`apps/api/src/lib/audit-purge-scheduler.ts`	Runs per pod. Idempotent — all pods issue the same `DELETE WHERE timestamp < cutoff`; duplicate work is harmless but wasteful. Consider disabling on all but one pod in very large deployments (operator choice).
Render-usage-purge scheduler	`apps/api/src/lib/render-usage-purge-scheduler.ts`	Same pattern as audit purge — idempotent, safe across pods.
Browser singleton (child-process render mode)	`apps/api/src/server.ts`	Chromium instance warmed per pod. Cannot be shared cross-process. See § 4.
Delivery dispatcher batch job store	`apps/api/src/lib/delivery/dispatcher.ts`	Known limitation: in-flight batch jobs held in-memory are lost if the pod restarts mid-batch. The DLQ is persisted to Postgres; permanent failures are not lost. Treat batch deliveries as best-effort across pod restarts.

Not applicable in HA

File storage modes (STORAGE_MODE=file, ASSET_BINARY_STORE=filesystem) — assume a single writer. Do not run multiple API pods against a shared filesystem; use Postgres + S3 instead.

3. Required Configuration

All pods must share the following values:

Variable	Value	Notes
`STORAGE_MODE`	`postgres` (or `sqlserver`)	File mode is not HA-safe
`DATABASE_URL`	Managed Postgres primary	Point replicas at the primary; Prisma does not currently split reads
`ASSET_BINARY_STORE`	`s3`	Required — shared-volume NFS mode also works but S3 is the reference
`S3_BUCKET`, `S3_REGION`, `S3_ACCESS_KEY_ID`, `S3_SECRET_ACCESS_KEY`, `S3_ENDPOINT`	Shared across pods	See deployment-guide.md § Object Storage
`API_KEY_ADMIN`, `API_KEY_EDITOR`, `API_KEY_RENDER`, `API_KEY_PREVIEW`	Identical across pods	`API_KEY_ADMIN`/`API_KEY_EDITOR` also sign editor tokens — a token minted by pod A must verify on pod B, so these must match (see “Editor token signing” below)
`TRUST_PROXY`	`true`	LB terminates TLS; real client IP in `X-Forwarded-For`
`REQUIRE_HTTPS`	`true`	Enforce the LB redirect contract
`TENANT_STATUS_CACHE_TTL_MS`	`10000` (default) or lower	See staleness note in § 2
`APP_VERSION`	Same across pods	Prevents mixed-version surprises in `/readyz` and capability responses

Editor token signing

Editor session tokens are HMAC-signed and verified against the cluster’s active credential set, not a dedicated secret. The candidate secrets (see editorCapableSecrets in auth.plugin.ts) are:

API_KEY_EDITOR and API_KEY_ADMIN (the primary signers),
the legacy API_KEY (when running in legacy single-key mode),
API_KEY_SUPER_ADMIN,
any API_KEYS_JSON / API_KEYS_JSON_FILE entry with admin or editor scope,
the verify-only rollover keys API_KEY_EDITOR_PREVIOUS / API_KEY_ADMIN_PREVIOUS,
in OIDC-only deployments (no API keys), a secret derived from OIDC_COOKIE_SECRET.

Operational consequences:

These must be identical across pods so a token minted on one verifies on another.
Rotating a signing key invalidates every in-flight editor token immediately (not just new mints). For a graceful rollover, set the new key, then carry the old value in API_KEY_*_PREVIOUS (verify-only) until outstanding tokens expire, then remove it.
There is no EDITOR_TOKEN_SECRET — if you have set it in an env file or compose, it is inert and can be removed.

Rollout and rotation

API key rotation under HA — use the documented API_KEY_*_PREVIOUS verify-only rollover variables. Set the new key on all pods first, then *_PREVIOUS on all pods, then swap clients over, then remove *_PREVIOUS.
Editor token invalidation — EDITOR_TOKEN_ISSUED_AFTER is a shared cutover: set it on all pods at the same timestamp and all existing tokens are rejected cluster-wide on the next request.

4. Render Isolation in HA

The rendering layer has three modes (child-process, container, socket). For HA:

`RENDER_MODE`	Recommended for HA?	Notes
`child-process` (default)	✅ per-pod	Each pod warms its own Chromium. Safe and simple.
`container`	✅	Each pod spawns a render container per request. Requires Docker socket; use cautiously (privileged).
`socket`	✅ (most isolated)	API pod has no Docker socket; a dedicated controller pod does. Best privilege separation.

Recommendation: start with child-process mode unless you have a specific privilege-separation requirement. Scale the API pods horizontally; render capacity scales with pod count.

5. Known Limitations

Batch delivery jobs are in-memory per pod — pod restart mid-batch loses in-flight job state (DLQ still captures permanent failures).
Audit and render-usage purge schedulers run per pod — harmless duplicate work. If this shows up in DB load metrics, operator may disable on all but one pod via env-var gating (not currently exposed — follow-up).
TenantStatusCache staleness window (default 10 s) — archive-a-tenant propagation is eventually consistent within TTL.
No read replicas — Prisma is configured against a single DATABASE_URL. Under very high read load, scale Postgres vertically or add a read-replica-aware proxy (PgBouncer + per-query routing) in front of the database; the app does not partition reads itself.

6. Reference Compose

A reference docker-compose.ha.yml is provided at the repo root.

This is a demo / evaluation stack, not a production reference. It exists to make the validation exercise in § 7 reproducible on a single host and to show the wiring. For production:

Replace MinIO with managed S3.

Replace the Postgres container with a managed Postgres service (backups, HA, PITR).

Replace the simple LB container with your production ingress (ALB, GCLB, nginx, Traefik, etc.).

Store secrets in your platform’s secret manager, not the compose file.

See docker-compose.ha.yml for the stack and docs/ha-validation-report.md for the validation results.

7. Validation Checklist

Mixed automated + manual coverage. Items marked automated run in .github/workflows/ha-nightly.yml against a fresh docker-compose.ha.yml stack every night and on workflow_dispatch. Manual items are smoke tests — rerun after major version upgrades or infrastructure changes. Results captured in ha-validation-report.md.

Shared asset readability (manual). Upload an asset via pod A, render a template referencing it via pod B. Expect: same asset bytes returned in the PDF.
Schedule fires exactly once (automated — scripts/ha/check-2-schedule-fires-once.mjs). Configure a cron schedule; start 2+ pods; wait for three ticks. Query /schedules/:id/executions — expect exactly one row per scheduled tick (not one per pod).
Editor token cross-pod (manual). Mint an editor token via pod A (POST /auth/editor-token), submit a template mutation via pod B with that token. Expect: 200 + audit row attributed to the minter.
Graceful degradation (partially covered by Check 7 below). Kill one pod mid-request; expect surviving pod continues to serve. The automated Check 7 quantifies this for the outage shape (process down, container present); rolling-replacement semantics with new container IPs are not yet automated and remain a manual smoke test.
Tenant archive propagation (manual). In multi-tenant mode, archive a tenant via pod A. Wait TENANT_STATUS_CACHE_TTL_MS. Expect: write attempts via pod B are rejected.
Key rotation (automated, editor-key variant — scripts/ha/check-6-api-key-rotation.mjs). Drives a four-stage rotation lifecycle (initial → api1 rotated → both rotated → grace ended) and asserts the _PREVIOUS rollover contract via direct per-replica probes plus a restart-window log scan. The manual playbook in ha-validation-report.md covers admin-key rotation; the automation exercises the same contract on the editor key.
Single-replica outage failover (automated — scripts/ha/check-7-outage-failover.mjs). Sustained load against the LB while each replica is docker compose stopped and started in turn. Asserts ≥99% success rate across the full window — exercises nginx’s proxy_next_upstream retry behavior on connection-refused. Scope note: this exercises the “process down, container present” outage shape (stable container identity, IP, hostname). It does NOT exercise rolling-replacement semantics where each replica is --force-recreated and gets a new IP — the demo nginx config resolves api1/api2 hostnames once at startup, so recreate semantics need a real ingress (k8s Service, ALB, or nginx with a resolver directive). That gap remains a manual smoke test.
Redis-backed rate-limit shared bucket (automated — scripts/ha/check-8-rate-limit-shared.mjs). The base docker-compose.ha.yml already ships a Redis service and sets RATE_LIMIT_STORE=redis on both replicas (shared rate limiting is the default HA posture); the CI overlay docker-compose.ha.redis.ci.yml only tightens RATE_LIMIT_MAX and exposes direct per-replica ports for the probe. The check exhausts the rate-limit bucket on api1 (5 priming requests + 1 sanity 429), then sends one request to api2 with the same client identity and asserts 429 — proving the bucket is genuinely shared via Redis rather than per-instance. A fresh-client probe on api2 (different X-Forwarded-For) returns 200, ruling out a “throttle everything” pathological mode. The bucket key derivation (${req.ip}:${routeClass} in single-tenant mode, prefixed with the @fastify/rate-limit pulp-engine-rl- namespace) is documented in the driver header and the overlay file in lockstep.

Editor user-registry consistency (v0.81.0: DB-backed, shared across replicas)

In STORAGE_MODE=postgres/sqlserver the named-user registry is DB-backed and shared across replicas (editor_users table). Each API instance runs an in-memory cache over the table with async read-through on a miss + a periodic full reload (EDITOR_USERS_CACHE_TTL_MS, default 10 s):

New / OIDC auto-provisioned users are visible on every replica immediately — a cache miss reads through to the shared table. OIDC provision races are resolved by the table’s unique constraints (the same subject reconciles; an id/key collision regenerates).
Role changes, tokenIssuedAfter revocations, and deletes propagate to other replicas within the cache TTL (the originating replica is immediate). This bounded staleness is the only cross-replica lag; keep EDITOR_USERS_CACHE_TTL_MS low for revocation-sensitive deployments.
EDITOR_USERS_JSON/EDITOR_USERS_FILE seed an empty table on first boot (under a seed-only-when-empty guard, so a deleted user is never resurrected); after that the DB is authoritative. Set EDITOR_USERS_DB=true to enable DB-backed named-user mode with no JSON/FILE seed.

File mode remains single-instance (flat-file registry), so the cross-replica concern does not apply there.

Still deferred to a separate follow-up batch:

Rolling-replacement (--force-recreate) failover semantics — needs a real ingress layer in the demo stack to be testable.
Multi-instance asset / S3 read-after-write consistency (the manual Check 1 above; Workstream D Phase 2).

← Back to docs index