Pulp Engine Document Rendering
Get started

Pulp Engine — Operator Runbook

Operator reference. Steps in order. Run everything from the repo root unless noted.


Pre-deployment checklist

  • Node 22–24 — node --version
  • pnpm 10.32.1 — pnpm --version
  • jq installed — required by scripts/smoke-test.sh (jq --version)
  • .env created from .env.example with STORAGE_MODE set (or left unset for postgres default)
  • NODE_ENV=production set in .env
  • At least one scoped API credential set in .env:
    • API_KEY_ADMIN (required for template management and full access)
    • API_KEY_RENDER (optional — render-only integrations)
    • API_KEY_PREVIEW (optional — preview routes, when PREVIEW_ROUTES_ENABLED=true)
    • API_KEY_EDITOR (optional — visual editor; operators enter this value in the editor login form — no VITE_API_KEY needed)
    • Legacy API_KEY accepted during migration (treated as admin) but deprecated; cannot coexist with the new keys
  • Security hardening configured — enforced by default when NODE_ENV=production:
    • CORS_ALLOWED_ORIGINS — specific trusted origins (not *)
    • DOCS_ENABLED — explicitly set (false strongly recommended unless Swagger UI is needed)
    • METRICS_TOKEN — bearer token for GET /metrics (openssl rand -hex 32)
    • REQUIRE_HTTPS=true — rejects editor-token login over plain HTTP
    • TRUST_PROXY=true — required when behind a TLS-terminating reverse proxy
    • BLOCK_REMOTE_RESOURCES=true — prevents render pipeline from fetching arbitrary external resources
    • EDITOR_USERS_JSON configured for per-user identity, or ALLOW_SHARED_KEY_EDITOR=true to acknowledge shared-key mode
    • Startup fails with a combined error listing all violations if any are missing.
    • Evaluation posture: set HARDEN_PRODUCTION=false to temporarily disable enforcement while configuring controls.
  • Asset binary store configured (ASSET_BINARY_STORE — default is filesystem):
    • Filesystem mode (default): ASSETS_DIR set to an absolute path (e.g. /var/pulp-engine/assets) — directory is auto-created on startup. ASSETS_BASE_URL set if the default /assets does not match your reverse proxy configuration.
    • S3 mode: see S3 pre-flight checklist below.
  • Asset access mode configured (ASSET_ACCESS_MODE — default public):
    • Public mode (default): no additional config required. S3 bucket must be publicly readable (see S3 pre-flight).
    • Private mode: set ASSET_ACCESS_MODE=private. S3 bucket does NOT need public-read; add s3:GetObject to credentials. S3_PUBLIC_URL not required.
  • Named-user mode (if using EDITOR_USERS_JSON): verify each user’s id is URL-safe, key is unique and does not duplicate any API_KEY_* value, role is editor or admin. Startup exits immediately with a descriptive error if misconfigured.
  • On Linux: Chromium system libraries installed (see deployment-guide.md §1)
  • Preview route posture confirmed: PREVIEW_ROUTES_ENABLED absent (routes return 404 in production) or intentionally set to true with network restrictions in place

Postgres mode (STORAGE_MODE=postgres or unset):

  • DATABASE_URL set in .env and reachable — psql "$DATABASE_URL" -c "\conninfo"

SQL Server mode (STORAGE_MODE=sqlserver):

  • SQL_SERVER_URL set in .env and reachable

File mode (STORAGE_MODE=file):

  • TEMPLATES_DIR set in .env and the directory contains valid template JSON files

S3 asset binary storage pre-flight

Complete this checklist before starting the API with ASSET_BINARY_STORE=s3.

  • Bucket exists and is in the correct region.
  • Credentials (S3_ACCESS_KEY_ID / S3_SECRET_ACCESS_KEY) have object-write/delete access (s3:PutObject, s3:DeleteObject) and the bucket-level access required for the HeadBucket startup probe (see deployment-guide.md § Object Storage).
  • Access mode (ASSET_ACCESS_MODE):
    • Public mode (default): bucket and objects must be publicly readable at S3_PUBLIC_URL. Puppeteer fetches asset URLs without auth headers. S3_PUBLIC_URL required when using a custom endpoint or path-style.
    • Private mode: bucket does not need public-read. API credentials must have s3:GetObject in addition to s3:PutObject, s3:DeleteObject. S3_PUBLIC_URL not required.
  • S3_PUBLIC_URL set (public mode only) when using a custom endpoint (S3_ENDPOINT) or path-style (S3_PATH_STYLE=true). Verify the URL is reachable from Puppeteer’s perspective (same network as the API container).
  • CORS configured on the bucket if the editor (browser) loads images directly from S3_PUBLIC_URL (origin GET). Not required if images are only fetched server-side by Puppeteer.
  • Verify bucket access from the deployment host:
# Quick connectivity probe (requires AWS CLI or equivalent)
aws s3 ls s3://$S3_BUCKET --region $S3_REGION
  • API startup log shows: "Asset binary store: S3" with the correct bucket and region. GET /health/ready returns 200 with all checks "ok".

Env vars required (S3 mode):

VariableExample
ASSET_BINARY_STOREs3
S3_BUCKETmy-pulp-engine-assets
S3_REGIONus-east-1
S3_ACCESS_KEY_IDAKIA...
S3_SECRET_ACCESS_KEY(secret)
S3_ENDPOINThttps://minio.example.com (custom providers only)
S3_PATH_STYLEtrue (MinIO only)
S3_PUBLIC_URLhttps://assets.example.com (required with custom endpoint or path-style)

Deployment steps

Run these in order. Each step must succeed before continuing.

Postgres mode (default)

# 1. Install dependencies
pnpm install

# 2. Generate Prisma client
pnpm db:generate

# 3. Apply all migrations to the database
pnpm --filter @pulp-engine/api db:deploy
# Already-applied migrations are skipped; safe to re-run

# 4. Load sample templates
pnpm db:seed
# Expected output: "loan-approval-letter@1.0.0 seeded" and "sample-invoice@1.0.0 seeded"

# 5. Build all packages
pnpm build

# 6. Start the API
node apps/api/dist/index.js
# Expected: JSON log line with "Pulp Engine API running on http://..."

File mode

# 1. Install dependencies
pnpm install

# 2. Generate Prisma client (compiles types only; no DB connection made)
pnpm db:generate

# 3. Build all packages
pnpm build

# 4. Start the API
node apps/api/dist/index.js
# Expected: JSON log line with "Pulp Engine API running on http://..."

No migration or seed step required — the API reads templates directly from TEMPLATES_DIR on startup.

SQL Server mode

# 1. Install dependencies
pnpm install

# 2. Generate Prisma client (compiles types only; no DB connection made)
pnpm db:generate

# 3. Apply SQL Server schema
pnpm --filter @pulp-engine/api db:migrate:sqlserver
# Creates the database if absent; idempotent — safe to re-run

# 4. Load sample templates
pnpm db:seed
# Expected output: "loan-approval-letter@1.0.0 seeded" and "sample-invoice@1.0.0 seeded"

# 5. Build all packages
pnpm build

# 6. Start the API
node apps/api/dist/index.js
# Expected: JSON log line with "Pulp Engine API running on http://..."

Register the process with your process manager after confirming step 6 works manually:

# PM2
pm2 start apps/api/dist/index.js --name pulp-engine-api
pm2 save

Migrating from file mode to a database backend

One-time migration when promoting a deployment from STORAGE_MODE=file to postgres or SQL Server.

Stop the API first. Migrate into an empty target database.

Existing records in the target are skipped, not updated. For a clean migration, use an empty database.

# 1. Apply the target schema (postgres example)
pnpm --filter @pulp-engine/api db:deploy
# SQL Server: pnpm --filter @pulp-engine/api db:migrate:sqlserver

# 2. Dry run — verify the startup lines show the correct paths and storage mode
STORAGE_MODE=postgres \
  TEMPLATES_DIR=/var/pulp-engine/templates \
  ASSETS_DIR=/var/pulp-engine/assets \
  DATABASE_URL="$DATABASE_URL" \
  pnpm --filter @pulp-engine/api db:migrate:file-to-db -- --dry-run

# 3. Run the migration
STORAGE_MODE=postgres \
  TEMPLATES_DIR=/var/pulp-engine/templates \
  ASSETS_DIR=/var/pulp-engine/assets \
  DATABASE_URL="$DATABASE_URL" \
  pnpm --filter @pulp-engine/api db:migrate:file-to-db
# Exit 0 = success; Exit 2 = partial (review warnings); Exit 1 = fatal

# 4. Update .env: set STORAGE_MODE=postgres; restart the API

Asset binaries are not moved by the script — ensure ASSETS_DIR is the same path in the target deployment, or copy binary files there first.

See deployment-guide.md §10 for full details, source-data error policy, and known limitations.


Smoke tests after deployment

Run the validation script immediately after the process starts:

# Runs liveness, readiness, metrics, auth, and (optionally) render checks
./scripts/validate-deploy.sh http://localhost:3000 $API_KEY_ADMIN loan-approval-letter

# Without a template key (skips render check — useful for fresh deployments pre-seed)
./scripts/validate-deploy.sh http://localhost:3000 $API_KEY_ADMIN

# Docker image deployment — also verify the bundled editor SPA is being served
EXPECT_EDITOR=true ./scripts/validate-deploy.sh http://localhost:3000 $API_KEY_ADMIN

The script exits 0 on success and 1 on any failure. Run it as part of your deployment pipeline or CI gate.

Bundled editor check (Docker image deployments)

When deploying the Docker image, verify the full editor path end-to-end:

# 1. Verify the editor SPA is served
curl -I http://localhost:3000/editor/
# Expected: HTTP/1.1 200 OK, Content-Type: text/html

# 2. Verify /editor redirects to /editor/
curl -I http://localhost:3000/editor
# Expected: HTTP/1.1 301 Moved Permanently, Location: /editor/

Then verify the editor can reach the API in a browser:

  1. Open http://[host]:3000/editor/ — the login screen should load (not an error or blank page)
  2. Enter API_KEY_EDITOR — the editor should load and /templates should be reachable
  3. If PREVIEW_ROUTES_ENABLED=true is set: open a template and click the preview button — it should render

Or use the validate script with EXPECT_EDITOR=true (checks 1–2 above automatically):

EXPECT_EDITOR=true ./scripts/validate-deploy.sh http://localhost:3000 $API_KEY_ADMIN

For live preview to work: PREVIEW_ROUTES_ENABLED=true must be set. The evaluator compose files set this automatically. See deployment-guide.md § Visual Editor for production guidance.

Detailed manual checks follow for diagnosis and additional coverage:

1. Health checks

# Liveness
curl -s http://localhost:3000/health
# Expected: { "status": "ok", "version": "0.51.0", "timestamp": "2026-..." }

# Readiness (verifies storage, asset binary store, and renderer are reachable)
curl -s http://localhost:3000/health/ready
# Expected: { "status": "ok", "version": "0.51.0", "timestamp": "2026-...", "checks": { "storage": "ok", "assetBinaryStore": "ok", "renderer": "ok" } }

A 503 from /health/ready means at least one subsystem check returned "error" or "timeout": storage — check template store connectivity (database or file system); assetBinaryStore — check binary asset store (file system or S3); renderer — check Chromium browser process or render dispatcher. Any single failing check causes 503. In API-only mode (no render dispatcher, preview disabled), the renderer check always reports "ok".

# Metrics scrape (Prometheus format)
curl -s http://localhost:3000/metrics | head -20
# Expected: lines starting with # HELP and process_cpu_seconds_total

2. List templates

curl -s http://localhost:3000/templates \
  -H "X-Api-Key: $API_KEY_ADMIN"

Expected: a paginated envelope { "items": [...], "total": N, "limit": 50, "offset": 0 } with items containing at least two entries — loan-approval-letter and sample-invoice. If items is empty: postgres or sqlserver mode → re-run pnpm db:seed; file mode → verify TEMPLATES_DIR is set correctly and contains valid JSON files.

3. HTML render (fast — no Puppeteer)

curl -s -X POST http://localhost:3000/render/html \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: $API_KEY_RENDER" \
  -d '{
    "template": "loan-approval-letter",
    "data": {
      "applicantName": "Smoke Test",
      "loanAmount": 10000,
      "interestRate": 5.0,
      "termMonths": 12,
      "requiresGuarantor": false,
      "items": []
    }
  }' | head -c 100

Expected: starts with <!DOCTYPE html>. Any 4xx or 5xx — check logs.

3b. CSV export (no Puppeteer)

curl -s -X POST http://localhost:3000/render/csv \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: $API_KEY_RENDER" \
  -d '{
    "template": "loan-approval-letter",
    "data": {
      "applicantName": "Smoke Test",
      "loanAmount": 10000,
      "interestRate": 5.0,
      "termMonths": 12,
      "requiresGuarantor": false,
      "items": [{ "description": "Test", "amount": 100 }]
    }
  }' | head -c 200

Expected: CSV header row + data rows. 422 with no_rendered_tables means the template has no table nodes. Any 5xx — check logs.

4. Asset management

curl -s http://localhost:3000/assets \
  -H "X-Api-Key: $API_KEY_ADMIN"

Expected: a paginated envelope { "items": [...], "total": 0, "limit": 50, "offset": 0 } (empty items array is fine on a fresh deployment — no assets have been uploaded yet). A 4xx or 5xx response indicates a routing or startup problem.

5. PDF render (end-to-end)

curl -s -X POST http://localhost:3000/render \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: $API_KEY_RENDER" \
  -d '{
    "template": "loan-approval-letter",
    "data": {
      "applicantName": "Smoke Test",
      "loanAmount": 10000,
      "interestRate": 5.0,
      "termMonths": 12,
      "requiresGuarantor": false,
      "items": []
    }
  }' --output /tmp/smoke.pdf && head -c 4 /tmp/smoke.pdf

Expected output: %PDF. This also warms up the Puppeteer browser singleton (first call takes ~2–3 s; subsequent calls are faster).

6. Confirm preview route gating (production only)

The key distinction: disabled preview routes return 404; enabled preview routes are registered and respond to the request (returning a validation error for invalid input, not 404).

If PREVIEW_ROUTES_ENABLED is not set (default — routes are disabled):

curl -s -o /dev/null -w "%{http_code}" -X POST \
  http://localhost:3000/render/preview/html \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: $API_KEY_ADMIN" \
  -d '{"template":{},"data":{}}'
# Expected: 404

If PREVIEW_ROUTES_ENABLED=true (routes are enabled): the route is registered — an invalid body triggers template validation rather than returning 404.

STATUS=$(curl -s -o /dev/null -w "%{http_code}" -X POST \
  http://localhost:3000/render/preview/html \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: $API_KEY_ADMIN" \
  -d '{"template":{},"data":{}}')
[ "$STATUS" != "404" ] && echo "OK — route registered ($STATUS)" || echo "FAIL — route not found"

Verify production logging

curl -s http://localhost:3000/health > /dev/null

Check the process output (or log file if piped). You should see a JSON line like:

{"level":30,"time":1234567890,"reqId":"req-1","req":{"method":"GET","url":"/health"},"res":{"statusCode":200},"responseTime":3.2,"msg":"request completed"}

Key fields to confirm are present: level, time, reqId, res.statusCode, responseTime.

If you see pretty-printed output instead of JSON — confirm NODE_ENV=production is set in .env and the process was restarted after the change.

If you see no log output at all for requests — confirm level is info in the server config (this was fixed in the pre-deployment pass; rebuild if on an older artifact).


Audit log events (v0.20.0+)

Three structured log event types are emitted for operator accountability. All include actor (operator-supplied actor label or null) and credentialScope.

editor_token_minted

Emitted on every successful POST /auth/editor-token.

FieldValue
eventeditor_token_minted
keyScopeScope of the key used to mint the token (admin or editor)
issuedAtISO-8601 timestamp
expiresAtISO-8601 timestamp
actorOperator-supplied actor label, or null if none was supplied

template_mutation

Emitted on every successful template write: POST /templates (create), PUT /templates/:key (update), DELETE /templates/:key (delete), POST /templates/:key/versions/:version/restore.

FieldValue
eventtemplate_mutation
operationcreate, update, delete, or restore
templateKeyThe template key
credentialScopeadmin or editor
actorOperator-supplied actor label, or null

asset_mutation

Emitted on every successful asset write: POST /assets/upload and DELETE /assets/:id.

FieldValue
eventasset_mutation
operationupload or delete
assetIdThe asset UUID
credentialScopeadmin or editor
actorOperator-supplied actor label, or null

actor: null means the write was performed via direct X-Api-Key auth, or no actor label was supplied at login. Raw API key values and token strings are never included in log payloads.

Queryable audit endpoint

In addition to structured logs, all three event types are persisted to the database and queryable via GET /audit-events (admin scope required). See the API guide for filter parameters and response format.

# Example: all mutations by a specific actor in the last 7 days
curl -s "http://localhost:3000/audit-events?actor=alice&since=$(date -u -d '-7 days' +%Y-%m-%dT%H:%M:%SZ)" \
  -H "X-Api-Key: $API_KEY_ADMIN"

Audit events are stored in the same database as templates and assets.

Purging old events: Use DELETE /audit-events?before=<ISO 8601> (admin scope) to remove events older than a given timestamp. The endpoint returns { "deleted": N }.

# Example: purge events older than 90 days
CUTOFF=$(date -u -d '-90 days' +%Y-%m-%dT%H:%M:%SZ)
curl -s -X DELETE "http://localhost:3000/audit-events?before=$CUTOFF" \
  -H "X-Api-Key: $API_KEY_ADMIN"

For automated retention, schedule a cron job or Kubernetes CronJob that calls this endpoint periodically (e.g., nightly with a 90-day cutoff). A common convention is to set AUDIT_RETENTION_DAYS=90 in your deployment’s environment as a reminder to the operator script of the chosen retention window.


Request correlation with X-Request-ID (v0.54.0+)

Every API response includes an X-Request-ID header containing a server-generated UUID. The same value appears as reqId in all structured log entries for that request.

Correlating a client error with server logs:

# 1. Extract the request ID from the response header
curl -s -D - http://localhost:3000/templates \
  -H "X-Api-Key: $API_KEY_ADMIN" 2>&1 | grep -i x-request-id
# X-Request-ID: 3bcc2c16-228b-4b09-8181-347201942b11

# 2. Search structured logs for that request
cat logs/api.json | jq 'select(.reqId == "3bcc2c16-228b-4b09-8181-347201942b11")'

The request ID is always server-generated and cannot be overridden by clients. Reverse proxies should forward (not strip) the X-Request-ID response header to downstream clients.


Verify templates

# List all templates (admin or editor key)
curl -s http://localhost:3000/templates \
  -H "X-Api-Key: $API_KEY_ADMIN" | jq '.items[].key'
# Expected: "loan-approval-letter", "sample-invoice"

# Get a sample payload (admin or editor key)
curl -s http://localhost:3000/templates/loan-approval-letter/sample \
  -H "X-Api-Key: $API_KEY_ADMIN"

# Validate a payload without rendering (editor or admin key)
curl -s -X POST http://localhost:3000/templates/loan-approval-letter/validate \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: $API_KEY_ADMIN" \
  -d "$(curl -s -H "X-Api-Key: $API_KEY_ADMIN" http://localhost:3000/templates/loan-approval-letter/sample)"
# Expected: { "valid": true, "issues": [] }

What to check if the API fails to start with HARDEN_PRODUCTION=true

When HARDEN_PRODUCTION=true, the API exits immediately with a combined error listing all violations. Example:

❌ HARDEN_PRODUCTION=true but required security controls are not configured:
   • CORS_ALLOWED_ORIGINS must be set to a comma-separated list of specific trusted origins ...
   • DOCS_ENABLED must be explicitly set. Use DOCS_ENABLED=false to disable the Swagger UI ...
   • METRICS_TOKEN must be set to protect GET /metrics with bearer authentication ...
Configure all required controls or unset HARDEN_PRODUCTION to disable enforcement.

Resolution — configure each listed control:

ControlWhat to set
CORS_ALLOWED_ORIGINS violationSet to comma-separated specific origins, e.g. CORS_ALLOWED_ORIGINS=https://editor.example.com. Wildcard * is not accepted in hardened mode.
DOCS_ENABLED violationExplicitly set DOCS_ENABLED=false (recommended) or DOCS_ENABLED=true to acknowledge exposure. Leaving it unset (defaulting) is rejected.
METRICS_TOKEN violationGenerate and set a token: METRICS_TOKEN=$(openssl rand -hex 32). Pass the same token to validate-deploy.sh as the 4th argument.
REQUIRE_HTTPS violationSet REQUIRE_HTTPS=true. Also set TRUST_PROXY=true (see below).
TRUST_PROXY violationSet TRUST_PROXY=true. Required when REQUIRE_HTTPS=true so Fastify can read X-Forwarded-Proto behind a TLS-terminating reverse proxy. Safe for direct-TLS deployments too.
BLOCK_REMOTE_RESOURCES violationSet BLOCK_REMOTE_RESOURCES=true to prevent the render pipeline from fetching resources from arbitrary public hosts during PDF generation. Optionally set ALLOWED_REMOTE_ORIGINS for trusted font/image CDNs.
Named-user registry violationWhen editor login is capable (any of API_KEY_EDITOR, API_KEY_ADMIN, or API_KEY set): configure EDITOR_USERS_JSON for per-user identity (recommended), or set ALLOW_SHARED_KEY_EDITOR=true to explicitly accept shared-key identity.

All seven controls must be in place before restarting the API with HARDEN_PRODUCTION=true.


What to check if the editor login fails

Login gate always shows / token invalid

SymptomCheck
Login form appears even with correct keyConfirm API_KEY_EDITOR is set on the API server; restart the API after changing it
”Editor login is not configured” config-error cardAPI_KEY_EDITOR (or API_KEY_ADMIN) is not set — only render/preview keys are present; set API_KEY_EDITOR
”Invalid key” error after entering the correct valueKey was entered with leading/trailing whitespace, or API_KEY_EDITOR was changed since the editor was last used; re-enter the correct value
Login succeeds but editor shows 401 immediatelyThe server-side API_KEY_EDITOR was rotated after the token was issued; all outstanding tokens are invalidated — log in again with the new key
Login gate blocks after API_KEY_EDITOR rotationExpected — token was signed with the old key. Log in again with the new API_KEY_EDITOR value.
Session expires mid-sessionToken TTL is controlled by EDITOR_TOKEN_TTL_MINUTES (default 8 hours). After expiry the editor automatically returns to the login form. Re-enter the key to continue.
Session token appears valid but 401 despite no rotationEDITOR_TOKEN_ISSUED_AFTER may be set to a time after the token was minted. Tokens with an issued-at before this threshold are rejected even if not expired.

Verify the auth endpoints are reachable

# Should return 200 with authRequired and editorLoginAvailable fields
curl -s http://localhost:3000/auth/status
# Expected: {"authRequired":true,"editorLoginAvailable":true}

# Should return 200 with a token (replace <key> with API_KEY_EDITOR value)
curl -s -X POST http://localhost:3000/auth/editor-token \
  -H "Content-Type: application/json" \
  -d '{"key":"<key>"}'
# Expected: {"token":"...","expiresAt":"..."}

If authRequired is true but editorLoginAvailable is false: only render/preview keys are configured; set API_KEY_EDITOR or API_KEY_ADMIN and restart the API.

HTTPS reminder: POST /auth/editor-token transmits API_KEY_EDITOR over the network. In production, ensure the API is served behind HTTPS (TLS-terminating reverse proxy). On plain HTTP, a network observer can capture the key at login time.

Invalidate outstanding editor sessions without key rotation (v0.19.0+)

If you suspect a session token has been compromised but do not want to rotate API_KEY_EDITOR (which would also disrupt other integrations using that key directly), use the issued-after guard:

  1. Note the current UTC time: date -u +"%Y-%m-%dT%H:%M:%SZ"
  2. Set EDITOR_TOKEN_ISSUED_AFTER=<timestamp> in your environment (e.g. 2026-03-24T14:00:00Z).
  3. Restart the API.

All editor tokens with an issued-at timestamp before the configured value will be rejected with 401. Users will see the login form and must mint a fresh token. Tokens issued after the threshold are unaffected.

Requirements and caveats:

  • The value must be a UTC ISO-8601 datetime string with an explicit offset (e.g. Z or +00:00). An invalid format causes the API to exit at startup.
  • The guard takes effect only after a restart — it is pre-computed from config, not re-read per request.
  • Pre-v0.19.0 tokens have no issued-at claim and are treated as iat=0 — they are always rejected when this guard is set. This is intentional: if you are running a mixed deployment, outstanding old-format tokens will be invalidated on upgrade when the guard is active.
  • In multi-instance deployments, server clocks must be reasonably synchronised (NTP); a large clock skew between instances means the guard may fire at slightly different times across nodes.
  • To disable the guard, unset EDITOR_TOKEN_ISSUED_AFTER and restart.

Auth secret rotation

How auth secrets work

All auth secrets are loaded once from environment variables at process startup. There is no hot reload — every rotation or guard change requires a process restart. The key structures are:

  • credentials map — built from API_KEY_ADMIN, API_KEY_EDITOR, API_KEY_RENDER, API_KEY_PREVIEW. Controls which keys are accepted as X-Api-Key and which keys can mint editor session tokens via POST /auth/editor-token.
  • editorCapableSecrets array — built from the admin and editor keys (plus any configured previous keys). Used by verifyEditorToken to validate X-Editor-Token headers. Supports multiple candidate secrets simultaneously, which enables the rollover window described in Procedures D and E below.
  • notBefore guard — pre-computed from EDITOR_TOKEN_ISSUED_AFTER. Rejects tokens with an issued-at timestamp before the configured value. See the “Invalidate outstanding editor sessions without key rotation” section above for the standalone procedure.

Clock synchronisation: In multi-instance deployments, server clocks must be NTP-synchronised. The notBefore guard and session token expiry checks are time-based; large clock skew between instances means the guard fires at inconsistent times across the fleet.


Procedure A — Invalidate sessions only (no key change)

If you need to force all active editor sessions to log in again without rotating API_KEY_EDITOR, use the EDITOR_TOKEN_ISSUED_AFTER guard described in the section above (“Invalidate outstanding editor sessions without key rotation”).


Procedure B — Rotate API_KEY_RENDER or API_KEY_PREVIEW

These keys have no session tokens. Near-zero-downtime rotation is not supported for them because there is no previous-key mechanism for render/preview keys.

Option 1 — Coordinated cutover (recommended):

  1. Generate new secret: openssl rand -base64 32
  2. Update API_KEY_RENDER (or API_KEY_PREVIEW) on all instances simultaneously.
  3. Restart all instances together.
  4. Callers must switch to the new key.

No mixed-auth window; brief downtime during restart.

Option 2 — Rolling rollout (accept temporary inconsistency):

  1. Update and restart instances one at a time.
  2. Restarted instances accept only the new key; pending instances accept only the old key.
  3. During the rollout, callers see intermittent 401 responses regardless of which key they use, because different instances disagree.

Use Option 2 only if temporary auth inconsistency is acceptable for these endpoints.


Procedure C — Rotate API_KEY_EDITOR with brief downtime

Use when session downtime during the restart window is acceptable (no active editor sessions, or coordinated with users).

Single instance:

  1. Generate new secret: openssl rand -base64 32
  2. Update API_KEY_EDITOR in .env.
  3. Restart. All tokens signed with the old key immediately fail HMAC verification. Users see 401 and must log in again with the new key.
  4. Validate: ./scripts/validate-deploy.sh

Multi-instance — coordinated restart:

  1. Update API_KEY_EDITOR on all instances simultaneously.
  2. Restart all instances. During the brief window when different instances hold different keys, tokens minted against old-key instances fail on new-key instances and vice versa.
  3. Once all instances are restarted with the new key, all pre-rotation sessions are invalidated.
  4. Keep the restart window short; perform during low-traffic periods.

Procedure D — Rotate API_KEY_EDITOR near-zero-downtime

Use API_KEY_EDITOR_PREVIOUS to allow tokens signed with the old key to continue verifying through the rollover window. This preserves existing editor sessions — it does not preserve direct X-Api-Key usage of the old key. Callers using the old key directly must coordinate a switch to the new key during the rollout.

Single instance:

  1. Generate new secret: openssl rand -base64 32
  2. Set API_KEY_EDITOR=<new> and API_KEY_EDITOR_PREVIOUS=<old> in .env.
  3. Restart. The startup log will emit a rollover warning — this is expected.
  4. Behaviour: new tokens are minted with the new key. Existing tokens signed with the old key continue to verify via API_KEY_EDITOR_PREVIOUS.
  5. Wait for the rollover window to close. Maximum wait is EDITOR_TOKEN_TTL_MINUTES (default 8 hours) after the restart.
  6. Remove API_KEY_EDITOR_PREVIOUS from .env and restart again.
  7. Validate: ./scripts/validate-deploy.sh

Multi-instance — rolling rotation (the primary use case):

  1. Generate new secret.
  2. For each instance in turn:
    • Set API_KEY_EDITOR=<new> and API_KEY_EDITOR_PREVIOUS=<old> on that instance.
    • Restart that instance.
    • Validate with ./scripts/validate-deploy.sh against that instance before continuing.
  3. After all instances are updated: every instance accepts both old-key tokens (via previous) and new-key tokens. Rolling restarts are safe for editor sessions.
  4. Wait for EDITOR_TOKEN_TTL_MINUTES to elapse since the first instance was restarted.
  5. Remove API_KEY_EDITOR_PREVIOUS from all instances and perform a second rolling restart.
  6. Final validation: ./scripts/validate-deploy.sh

Key invariants:

  • API_KEY_EDITOR_PREVIOUS must not equal API_KEY_EDITOR, API_KEY_ADMIN, API_KEY_RENDER, or API_KEY_PREVIEW. The server rejects the combination at startup.
  • The previous key cannot be submitted to POST /auth/editor-token (returns 401).
  • The previous key cannot be used as X-Api-Key (returns 401).
  • The rollover window is at most EDITOR_TOKEN_TTL_MINUTES. After that window, all old-key tokens have naturally expired and API_KEY_EDITOR_PREVIOUS can be removed.
  • Do not leave the previous key set indefinitely — it represents a second verifiable secret.

Procedure E — Rotate API_KEY_ADMIN

Identical to Procedure D but using API_KEY_ADMIN and API_KEY_ADMIN_PREVIOUS. Note that API_KEY_ADMIN can mint editor session tokens (in addition to its full admin scope), so the same TTL window reasoning applies.

Near-zero-downtime applies to existing editor sessions signed with the old admin key. It does not preserve direct X-Api-Key usage of the old API_KEY_ADMIN — callers using the admin key directly as X-Api-Key must coordinate a switch to the new key.


Caveats and checklist

  • Restart is always required. No env var change takes effect without restarting the process.
  • Verify the rollover warning in the startup log. When API_KEY_EDITOR_PREVIOUS or API_KEY_ADMIN_PREVIOUS is set, the server emits a warn-level log at startup confirming rollover mode is active. If you do not see this log, the previous key was not loaded.
  • Previous key TTL window. The safe time to remove API_KEY_EDITOR_PREVIOUS is ≥ EDITOR_TOKEN_TTL_MINUTES after the first instance was restarted with the new key. Removing it earlier may invalidate sessions on instances that have not yet restarted.
  • Do not leave previous keys in place indefinitely. Remove them once the rollover window is closed.
  • NTP sync required for EDITOR_TOKEN_ISSUED_AFTER. If you use the issued-after guard in a multi-instance deployment, server clocks must be NTP-synchronised. Large skew means the guard fires at inconsistent times across the fleet.
  • Multi-instance deployments require identical env at steady state. All instances must have the same active keys once the rollout is complete.

What to check if PDF rendering fails

1. Check the request log for the error.

Look for a log line where res.statusCode is 500 and follow the reqId. An err field will be present on the same or adjacent line:

{"level":50,"reqId":"req-4","err":{"message":"...","stack":"..."},"msg":"..."}

2. Common failure causes:

SymptomCheck
err.message contains Could not find ChromiumPuppeteer install incomplete — re-run pnpm install
err.message contains error while loading shared librariesMissing Linux system libraries — see deployment-guide.md §1
err.message contains Navigation timeoutPuppeteer setContent timed out (30 s limit) — template HTML may be too large or contain blocking resources
err.message contains Target closedBrowser singleton crashed — restart the API process; browser will re-launch on next request
err.message contains Cannot find template or 404Postgres or SQL Server mode: template not seeded — re-run pnpm db:seed or create/import templates via the API. File mode: verify TEMPLATES_DIR path and that the target JSON file is valid.
High memory usage before failurePDF buffered in memory; large document — reduce template complexity or increase server RAM
PDF requests queue and don’t respond immediately under loadExpected behaviour — the concurrency limiter allows at most 5 simultaneous Chrome pages. Requests beyond that wait in FIFO order and are served as slots free. If the queue never drains, check for hung pages by restarting the API.
Batch requests (POST /render/batch) are slowEach batch processes up to BATCH_CONCURRENCY items in parallel (default: 5). A 50-item batch runs 10 sequential waves. Reduce BATCH_MAX_ITEMS or increase BATCH_CONCURRENCY (but keep it ≤ MAX_CONCURRENT_PAGES to avoid starving single renders). Monitor pulp_engine_renders_total{type="batch-pdf"} for throughput.
Server log shows ERR_STREAM_PREMATURE_CLOSE at info levelNormal — a client disconnected mid-stream. The error handler suppresses the cascade. No action needed; the page slot is released automatically.

3. Isolate with an HTML render first.

If POST /render/html succeeds but POST /render fails, the problem is in Puppeteer, not in the template or data pipeline.


Asset upload validation

Asset uploads are validated server-side at the store layer before any binary is written.

Accepted formats: PNG, JPEG, GIF, WebP. All other types — including SVG — are rejected.

Two-stage validation:

  1. Allowlist check — the declared MIME type must be one of the four accepted types. SVG (image/svg+xml) is rejected with an explicit error message citing script-injection risk.
  2. Magic-bytes check — the file’s actual content is inspected (first 4–12 bytes) and compared against the declared type. If they do not match, the upload is rejected even if the MIME type would otherwise be allowed.

HTTP 415 Unsupported Media Type is returned for all of these failure cases:

CaseExample
Declared type not in allowlistimage/bmp, application/javascript
SVG declaredimage/svg+xml
Content does not match declared typeJPEG file submitted with Content-Type: image/png
File content unrecognizedRenamed script or binary with an image MIME type
File too short to detectFewer than 4 bytes

MIME normalization: The declared MIME type is normalized (trimmed, lowercased, parameters stripped) before validation. image/PNG; charset=binary is treated as image/png. The normalized value is what is stored in metadata.

Existing SVG assets (residual risk): SVG assets uploaded before v0.27.0 are not automatically migrated or removed. They continue to be served by all four serve paths:

  • Private-mode proxy (GET /assets/:filename): content type derived from file extension — existing .svg files served as image/svg+xml.
  • Private-mode inline rendering: MIME type derived from file extension for base64 data URIs — existing .svg files inlined as image/svg+xml data URIs.
  • Public-mode filesystem: @fastify/static serves by extension — unchanged.
  • Public-mode S3: files stored in S3 with the ContentType set at upload time — unchanged.

The API server logs a legacy_svg_detected warning at startup if assets matching either detection signal are present (declared mimeType: image/svg+xml or filename ending in .svg). This warning repeats on every restart until the assets are removed.

Remediation workflow:

  1. Enumerate legacy SVG candidates (admin credentials required):

    GET /assets?legacySvg=true

    Returns all assets matched by declared mimeType image/svg+xml or by .svg filename extension. This covers both correctly declared SVGs and extension-only mismatches from the pre-v0.27.0 MIME-trust era.

  2. Identify template references for each returned asset. Check template definitions or run a test render — templates that reference the SVG asset will break if it is deleted before replacement.

  3. Upload a raster replacement: POST /assets/upload with a PNG or WebP version of the image.

  4. Update template definitions to reference the new raster asset URL instead of the SVG.

  5. Delete the legacy SVG: DELETE /assets/:id (admin credentials required). Only do this after templates have been updated.

  6. Confirm remediation: restart the server — the legacy_svg_detected warning will not appear once all matching assets are removed.

Note: Assets whose SVG content was mislabeled at upload time (e.g. stored as image/png with a non-.svg filename) are not detectable without binary inspection of every stored file. The workflow above covers the common case of correctly declared and extension-identified legacy SVGs.


Metrics-based alert definitions

The following PromQL expressions are recommended as starting-point alert rules. Adjust thresholds to match your traffic volume.

Render failure rate

# Alert if > 10% of PDF renders in the last 5 minutes are failures
rate(pulp_engine_render_requests_total{type="pdf",status="failure"}[5m])
  /
rate(pulp_engine_render_requests_total{type="pdf"}[5m])
  > 0.10

Runbook: Check API logs for reason="render_error" entries. Run the PDF smoke test manually (POST /render) to confirm — if it passes, the failures may be from a specific bad template. If Puppeteer is failing consistently, restart the API to force a browser re-launch.


Auth failure spike (possible credential scanning)

# Alert if invalid-key failures exceed 20 per minute
rate(pulp_engine_auth_failures_total{reason="invalid_key"}[1m]) * 60 > 20

Runbook: Check access logs for the originating IP. If traffic is from an unexpected source, apply a rate limit or IP block at the reverse proxy.


Storage readiness degraded

# Alert if /health/ready has returned 503 more than once in 2 minutes
changes(up{job="pulp-engine"}[2m]) > 0

Or poll /health/ready directly from your uptime monitor (recommended — simpler than a scrape-based alert for storage checks).

Runbook: Check database / file system availability. For postgres: psql "$DATABASE_URL" -c "\conninfo". For file mode: confirm TEMPLATES_DIR is mounted and readable. Once storage recovers, the readiness probe automatically returns 200.


High P99 PDF render latency

# Alert if P99 PDF render latency exceeds 25 seconds
histogram_quantile(0.99,
  rate(pulp_engine_http_request_duration_seconds_bucket{route="render_pdf"}[5m])
) > 25

Runbook: PDF render time depends on template complexity and Puppeteer browser health. Check for hung Chrome processes. If the API is under sustained load (>5 concurrent render requests), queue back-pressure is expected — alert may be a false positive during traffic spikes.


Elevated version conflicts

# Alert if optimistic-concurrency conflicts exceed 5 per minute
rate(pulp_engine_template_mutations_total{status="conflict"}[1m]) * 60 > 5

Runbook: Multiple concurrent editor sessions updating the same template. This is expected at low rates. At elevated rates it may indicate a runaway automation loop or a UI bug. Check which template is causing conflicts via API logs.


Dead-letter queues

Two persistent DLQs exist. Both require admin scope, SCHEDULE_ENABLED=true, and Postgres storage.

/admin/schedule-dlq — failed scheduled deliveries

Schedule executions fail into this queue when every retry for a delivery target exhausts backoff.

# List pending entries (paginated, filter by status/scheduleId)
curl -s "http://localhost:3000/admin/schedule-dlq?status=pending&limit=50" \
  -H "X-Api-Key: $API_KEY_ADMIN" | jq .

# Replay — rehydrates the CURRENT schedule config (operator fixes picked up)
curl -X POST "http://localhost:3000/admin/schedule-dlq/<id>/replay" \
  -H "X-Api-Key: $API_KEY_ADMIN"

# Abandon — mark terminal without delivering
curl -X POST "http://localhost:3000/admin/schedule-dlq/<id>/abandon" \
  -H "X-Api-Key: $API_KEY_ADMIN"

Replay can refuse with one of these 409 Conflict codes: schedule_gone (underlying schedule deleted — entry marked orphaned), schedule_mutated (delivery target changed or removed — orphaned), render_artifact_expired (rendered artefact purged — abandon and retrigger via POST /schedules/:id/trigger), dispatcher_unavailable (SCHEDULE_ENABLED=false), already_terminal (already replayed/abandoned/orphaned). Secrets are never echoed — the DLQ stores references only.

/admin/batch-dlq — failed async batch webhooks

Same shape (GET /, GET /:id, POST /:id/replay) for callback webhooks from POST /render/batch/async jobs.


Multi-tenant operations

Routes below require super-admin (scope=admin with tenantId=null — tenant-bound admins are rejected 403 super_admin_only). Active only when MULTI_TENANT_ENABLED=true and STORAGE_MODE=postgres; otherwise every route returns 503 unavailable.

# Create a tenant (slug is immutable)
curl -X POST http://localhost:3000/admin/tenants \
  -H "X-Api-Key: $API_KEY_SUPER_ADMIN" \
  -H "Content-Type: application/json" \
  -d '{"id":"acme-corp","name":"ACME Corporation"}'

# Soft-archive (blocks writes; reads continue for audit export)
curl -X POST http://localhost:3000/admin/tenants/acme-corp/archive \
  -H "X-Api-Key: $API_KEY_SUPER_ADMIN"

# Restore
curl -X POST http://localhost:3000/admin/tenants/acme-corp/unarchive \
  -H "X-Api-Key: $API_KEY_SUPER_ADMIN"

Archive vs delete. Archive is the supported tenant-offboarding primitive. DELETE /admin/tenants/:id returns 501 Not Implemented until a cascade policy lands for audit events, versions, DLQ history, and scheduled deliveries.

Cache TTL. Archive/unarchive bust the handling pod’s TenantStatusCache immediately; other pods catch up within TENANT_STATUS_CACHE_TTL_MS (default 10 s). Writes to an archived tenant return 409 tenant_archived.


Runtime named-user management

Named-user mode is enabled when either EDITOR_USERS_JSON or EDITOR_USERS_FILE is set. Runtime CRUD via /admin/users mutates the in-memory registry; with EDITOR_USERS_FILE configured, mutations persist to disk via atomic write. Without it, changes are lost on restart.

# List users (key redacted to last 4 chars as keyHint)
curl -s http://localhost:3000/admin/users -H "X-Api-Key: $API_KEY_ADMIN" | jq .

# Add a user
curl -X POST http://localhost:3000/admin/users \
  -H "X-Api-Key: $API_KEY_ADMIN" \
  -H "Content-Type: application/json" \
  -d '{"id":"alice","displayName":"Alice","key":"<strong-random>","role":"editor"}'

# Revoke all sessions for one user (effective immediately on next auth check)
curl -X PUT http://localhost:3000/admin/users/alice \
  -H "X-Api-Key: $API_KEY_ADMIN" \
  -H "Content-Type: application/json" \
  -d "{\"tokenIssuedAfter\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}"

# Delete (existing tokens remain valid until expiry — no active revocation
# list; set tokenIssuedAfter first if you need immediate cutoff)
curl -X DELETE http://localhost:3000/admin/users/alice \
  -H "X-Api-Key: $API_KEY_ADMIN"

# Re-read EDITOR_USERS_FILE after editing externally; OIDC-auto-provisioned
# users are merged in, not clobbered
curl -X POST http://localhost:3000/admin/users/reload \
  -H "X-Api-Key: $API_KEY_ADMIN"

POST /admin/users/reload returns 404 if EDITOR_USERS_FILE does not exist on disk. When DELETE drops the registry to zero, the response carries X-PulpEngine-Warning — editor login will be unavailable until at least one user exists.


Rollback

Rollback is straightforward when running the Docker image — the image tag is the artifact version. Each released version maps 1-to-1 with a git tag (e.g., v0.PREV.Y).

# 1. Stop and remove the current container
docker stop pulp-engine && docker rm pulp-engine

# 2. Start the previous image tag (no build required)
docker run -d --name pulp-engine \
  [same -p, -e, and -v flags as the original deployment] \
  ghcr.io/OWNER/pulp-engine:v0.PREV.Y

# 3. Validate the rollback
./scripts/validate-deploy.sh http://localhost:3000 $API_KEY_ADMIN

The previous image is already in the registry — docker run pulls it if it’s not cached locally.

Postgres schema rollback: The previous image’s migrations are already applied — no schema rollback is needed unless you ran forward-only schema changes that break the old code. If that is the case, restore from a database backup taken before the migration ran, then start the rollback image.

Database unavailability (postgres): The API refuses to start if Prisma cannot connect at boot. Restore the PostgreSQL instance and restart the container — no application changes needed.

SQL Server unavailability: Same pattern — restore the SQL Server instance, restart the container.

File mode: If TEMPLATES_DIR is unreachable, the API will fail to start. Verify the volume mount and restart.

Bare-metal rollback (non-Docker deployments)

# 1. Stop the running process
pm2 stop pulp-engine-api
# or: kill -SIGTERM <pid>   (graceful shutdown — waits for in-flight requests)

# 2. Check out the previous release tag and rebuild
git checkout v0.PREV.Y
pnpm install
pnpm db:generate
pnpm build

# 3. Restart
pm2 start pulp-engine-api

# 4. Validate
./scripts/validate-deploy.sh http://localhost:3000 $API_KEY_ADMIN

Upgrading from v0.22.0 to v0.23.0

No breaking changes to the API surface. The upgrade adds two nullable columns to the database schema and two new optional environment variables.

Postgres

A new Prisma migration (add_created_by) is included in v0.23.0. Run as part of the normal upgrade procedure:

pnpm --filter @pulp-engine/api db:deploy

This adds created_by to template_versions and assets. The columns are nullable — safe to apply on a live database with no downtime.

SQL Server

Run the migration runner before starting v0.23.0:

pnpm --filter @pulp-engine/api db:migrate:sqlserver

This applies 002_add_created_by.sql, which adds the same nullable created_by columns. The runner handles both fresh installs and upgrades from v0.22.0 automatically — no manual DDL required.

File mode

No schema changes needed. createdBy is carried in the in-memory asset index; existing records report createdBy: null.

New optional environment variables

VariableDefaultNotes
ASSET_ACCESS_MODEpublicAbsent → public mode (unchanged behaviour)
EDITOR_USERS_JSON(absent)Absent → shared-key mode (unchanged behaviour)

No existing environment variables were renamed or removed.