Pulp Engine — HA / Clustering Reference Architecture
Reference architecture for running Pulp Engine across multiple API replicas behind a load balancer. Pairs with deployment-guide.md (single-instance topology) and runbook.md (operational procedures).
This document is enterprise-oriented: it assumes a managed Postgres, an object store (S3 / MinIO / R2), and an HTTPS-terminating load balancer are available. For single-instance or evaluation deployments, use the simpler topologies in the deployment guide.
1. Topology
┌───────────────────────┐
│ HTTPS Load Balancer │ (sticky sessions NOT required)
│ TLS termination │
└───────────┬───────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ API pod │ │ API pod │ ... │ API pod │
│ N=1 │ │ N=2 │ │ N=k │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
├─────────────────┴─────────────────┤
▼ ▼
┌───────────────┐ ┌─────────────────┐
│ Postgres │ │ Object store │
│ (primary + │ │ (S3 / MinIO / │
│ replicas) │ │ R2 — shared) │
└───────────────┘ └─────────────────┘
Key properties:
- Request handlers are stateless — no sticky sessions required.
- Editor session tokens are HMAC-signed (apps/api/src/lib/editor-token.ts) — validated against the shared
EDITOR_TOKEN_SECRET, no session store. - All durable state lives in Postgres + the object store. Nothing on local pod disk is authoritative.
2. Stateless vs Stateful Components
Stateless (scale horizontally without coordination)
- HTTP request handlers
- Editor session tokens (5-part HMAC-signed, no storage)
- OIDC auth code flow (stateless completion-code delivery)
- Capability responses
- Template / asset / render routes
Shared state (authoritative — all pods read/write)
- Postgres — templates, versions, labels, assets metadata, audit events, schedules + executions + DLQ, tenant registry, render usage. Schema: apps/api/src/prisma/schema.prisma.
- Object store — asset binaries (
ASSET_BINARY_STORE=s3).
Per-pod state (multi-instance safe — see notes below)
| Component | File | Multi-instance behaviour |
|---|---|---|
| Schedule dispatcher | apps/api/src/lib/schedule-engine.ts | Each pod polls independently; DB row-level claim via INSERT … ON CONFLICT … SKIP LOCKED guarantees a given schedule execution fires exactly once across the cluster. No leader election required. |
| TenantStatusCache | apps/api/src/lib/tenant-status-cache.ts | Per-pod TTL cache (default 10 s). Tenant archive operations have a ≤ TTL staleness window before all pods converge. Tune via TENANT_STATUS_CACHE_TTL_MS. Acceptable for typical workloads; set to a lower value if you need stricter archive propagation. |
| Audit-purge scheduler | apps/api/src/lib/audit-purge-scheduler.ts | Runs per pod. Idempotent — all pods issue the same DELETE WHERE timestamp < cutoff; duplicate work is harmless but wasteful. Consider disabling on all but one pod in very large deployments (operator choice). |
| Render-usage-purge scheduler | apps/api/src/lib/render-usage-purge-scheduler.ts | Same pattern as audit purge — idempotent, safe across pods. |
| Browser singleton (child-process render mode) | apps/api/src/server.ts | Chromium instance warmed per pod. Cannot be shared cross-process. See § 4. |
| Delivery dispatcher batch job store | apps/api/src/lib/delivery/dispatcher.ts | Known limitation: in-flight batch jobs held in-memory are lost if the pod restarts mid-batch. The DLQ is persisted to Postgres; permanent failures are not lost. Treat batch deliveries as best-effort across pod restarts. |
Not applicable in HA
- File storage modes (
STORAGE_MODE=file,ASSET_BINARY_STORE=filesystem) — assume a single writer. Do not run multiple API pods against a shared filesystem; use Postgres + S3 instead.
3. Required Configuration
All pods must share the following values:
| Variable | Value | Notes |
|---|---|---|
STORAGE_MODE | postgres (or sqlserver) | File mode is not HA-safe |
DATABASE_URL | Managed Postgres primary | Point replicas at the primary; Prisma does not currently split reads |
ASSET_BINARY_STORE | s3 | Required — shared-volume NFS mode also works but S3 is the reference |
S3_BUCKET, S3_REGION, S3_ACCESS_KEY_ID, S3_SECRET_ACCESS_KEY, S3_ENDPOINT | Shared across pods | See deployment-guide.md § Object Storage |
EDITOR_TOKEN_SECRET | Identical across pods | HMAC key — token minted by pod A must verify on pod B |
API_KEY_ADMIN, API_KEY_EDITOR, API_KEY_RENDER, API_KEY_PREVIEW | Identical across pods | |
TRUST_PROXY | true | LB terminates TLS; real client IP in X-Forwarded-For |
REQUIRE_HTTPS | true | Enforce the LB redirect contract |
TENANT_STATUS_CACHE_TTL_MS | 10000 (default) or lower | See staleness note in § 2 |
APP_VERSION | Same across pods | Prevents mixed-version surprises in /readyz and capability responses |
Rollout and rotation
- API key rotation under HA — use the documented
API_KEY_*_PREVIOUSverify-only rollover variables. Set the new key on all pods first, then*_PREVIOUSon all pods, then swap clients over, then remove*_PREVIOUS. - Editor token invalidation —
EDITOR_TOKEN_ISSUED_AFTERis a shared cutover: set it on all pods at the same timestamp and all existing tokens are rejected cluster-wide on the next request.
4. Render Isolation in HA
The rendering layer has three modes (child-process, container, socket). For HA:
RENDER_MODE | Recommended for HA? | Notes |
|---|---|---|
child-process (default) | ✅ per-pod | Each pod warms its own Chromium. Safe and simple. |
container | ✅ | Each pod spawns a render container per request. Requires Docker socket; use cautiously (privileged). |
socket | ✅ (most isolated) | API pod has no Docker socket; a dedicated controller pod does. Best privilege separation. |
Recommendation: start with child-process mode unless you have a specific privilege-separation requirement. Scale the API pods horizontally; render capacity scales with pod count.
5. Known Limitations
- Batch delivery jobs are in-memory per pod — pod restart mid-batch loses in-flight job state (DLQ still captures permanent failures).
- Audit and render-usage purge schedulers run per pod — harmless duplicate work. If this shows up in DB load metrics, operator may disable on all but one pod via env-var gating (not currently exposed — follow-up).
- TenantStatusCache staleness window (default 10 s) — archive-a-tenant propagation is eventually consistent within TTL.
- No read replicas — Prisma is configured against a single
DATABASE_URL. Under very high read load, scale Postgres vertically or add a read-replica-aware proxy (PgBouncer + per-query routing) in front of the database; the app does not partition reads itself.
6. Reference Compose
A reference docker-compose.ha.yml is provided at the repo root.
This is a demo / evaluation stack, not a production reference. It exists to make the validation exercise in § 7 reproducible on a single host and to show the wiring. For production:
- Replace MinIO with managed S3.
- Replace the Postgres container with a managed Postgres service (backups, HA, PITR).
- Replace the simple LB container with your production ingress (ALB, GCLB, nginx, Traefik, etc.).
- Store secrets in your platform’s secret manager, not the compose file.
See docker-compose.ha.yml for the stack and docs/ha-validation-report.md for the validation results.
7. Validation Checklist
Manual smoke test — not an automated regression gate. Rerun after major version upgrades or infrastructure changes. Results captured in ha-validation-report.md.
- Shared asset readability. Upload an asset via pod A, render a template referencing it via pod B. Expect: same asset bytes returned in the PDF.
- Schedule fires exactly once. Configure a cron schedule; start 2+ pods; wait for one tick. Query
schedule_executionstable — expect exactly one row per scheduled tick (not one per pod). - Editor token cross-pod. Mint an editor token via pod A (
POST /editor-token), submit a template mutation via pod B with that token. Expect: 200 + audit row attributed to the minter. - Graceful degradation. Kill one pod mid-request. Expect: other pods continue to serve; LB routes around the failed pod.
- Tenant archive propagation. In multi-tenant mode, archive a tenant via pod A. Wait
TENANT_STATUS_CACHE_TTL_MS. Expect: write attempts via pod B are rejected. - Key rotation. Follow the
API_KEY_*_PREVIOUSrunbook. Expect: zero 401s during the rotation window when clients are switched one-by-one.
An automated harness for this checklist is tracked as a follow-up initiative.