Pulp Engine — Backup & Restore Runbook

Canonical procedure for operators to back up a Pulp Engine deployment and restore from backup. Pairs with deployment-guide.md (topology + config) and ha-reference-architecture.md (multi-replica deployments).

This runbook assumes a Postgres + object-store topology (the production recommendation). For file-mode evaluation deployments, see § 6.

1. What to back up

Pulp Engine has exactly two durable stores. Everything else (request handlers, editor tokens, capability caches, in-memory schedulers) is derivable from these two.

Layer	What’s in it	Where it lives
Postgres	Templates, versions, labels, assets metadata, audit events, schedules + executions + DLQ, tenant registry, render usage	`DATABASE_URL` — schema in `apps/api/src/prisma/schema.prisma`
Asset binaries	Image files uploaded via the asset library	`ASSETS_DIR` (filesystem) or the S3 bucket (`ASSET_BINARY_STORE=s3`)

Not durable / do not back up:

.env / secrets — treat these as configuration managed by your platform’s secret manager
Local pod disk (rendered PDFs are transient, Chromium cache is rebuilt on start)
In-memory state — the tenant status cache reconstructs automatically on restart. Async batch jobs are durable on postgres deployments (as of v0.72.0): the job metadata lives in the batch_jobs Postgres table and completed result envelopes live in IJobResultBlobStore (filesystem at JOB_RESULT_BLOB_DIR or S3 at JOB_RESULT_BLOB_BUCKET). On restart, pending/processing rows older than STARTUP_ORPHAN_GRACE_MS are failed with code job_abandoned_at_startup; remaining active rows rehydrate into the hot cache. Completed jobs remain pollable until WEBHOOK_JOB_RETENTION_SECONDS elapses. The processing-timeout sweep still handles crashes that happen later in processJob(). File-mode and SQL Server deployments retain the pre-T3 in-memory-only behaviour (jobs lost on pod restart). The delivery dispatcher DLQ is Postgres-backed and survives restart.

2. Backup procedure

Authoritative backup path is pg_dump + a sync of the asset store. The CLI (§ 4) adds a lightweight inventory/verification layer on top — it does not replace pg_dump.

2.1 Postgres

# Full, compressed, custom-format dump (restorable via pg_restore).
pg_dump \
  --format=custom \
  --no-owner \
  --no-privileges \
  --file=pulp-engine-$(date -u +%Y%m%dT%H%M%SZ).dump \
  "$DATABASE_URL"

Recommended cadence:

Production: continuous archiving (WAL-E / WAL-G / managed-Postgres PITR) + daily full dump for operator-visible artifacts.
Evaluation / staging: daily dump is sufficient.

2.2 Asset binaries

S3 mode (ASSET_BINARY_STORE=s3):

# Enable S3 bucket versioning once (recommended — gives you PITR for blobs):
aws s3api put-bucket-versioning --bucket "$S3_BUCKET" \
  --versioning-configuration Status=Enabled

# Snapshot copy into a dated backup bucket/prefix:
aws s3 sync "s3://$S3_BUCKET" "s3://$BACKUP_BUCKET/assets-$(date -u +%Y%m%dT%H%M%SZ)/"

Filesystem mode (ASSET_BINARY_STORE=filesystem):

tar -czf assets-$(date -u +%Y%m%dT%H%M%SZ).tar.gz -C "$(dirname "$ASSETS_DIR")" "$(basename "$ASSETS_DIR")"

2.3 Consistency

Pulp Engine writes the asset binary first and the metadata row second within the same request. A backup taken concurrently with active writes can capture:

An asset binary with no metadata row (harmless — the binary becomes unreferenced; validated-publish will flag on first reference).
A metadata row with no binary (render will fail fast with ASSET_BINARY_MISSING — the documented fail-closed behaviour from v0.35.0).

To eliminate the inconsistency window, prefer one of:

Postgres point-in-time recovery + S3 bucket versioning (belt-and-suspenders — recover both stores to a matching wall-clock moment).
A brief maintenance window: stop the API pods, run both backups, restart.

For most operators continuous archiving + versioning is sufficient and the fail-closed behaviour on render makes the inconsistency safe to tolerate.

2.4 Inventory + verification (optional)

After a backup run, use the CLI to capture a manifest that can be verified later:

pulp-engine backup create --out ./backup-$(date -u +%Y%m%dT%H%M%SZ) \
  --api-url http://localhost:3000 --api-key $API_KEY_ADMIN

This does not dump the database and does not copy any binaries — it captures a metadata inventory (counts + per-template/per-asset identifying fields) that a later backup verify run can diff against the live API. See § 4.

3. Restore procedure

Restore order matters: restore the object store first (so that metadata rows have backing binaries), then Postgres, then restart the API.

3.1 Stop the API

docker compose -f compose.postgres.yaml stop pulp-engine

3.2 Restore asset binaries

S3 mode:

aws s3 sync "s3://$BACKUP_BUCKET/assets-<timestamp>/" "s3://$S3_BUCKET"

Filesystem mode:

rm -rf "$ASSETS_DIR"
mkdir -p "$ASSETS_DIR"
tar -xzf assets-<timestamp>.tar.gz -C "$(dirname "$ASSETS_DIR")"

3.3 Restore Postgres

DATABASE_URL_MAINTENANCE below is the same host and credentials as DATABASE_URL with the database path set to postgres (the built-in maintenance database) — Postgres cannot drop the database you are connected to. Example: postgresql://pulp-engine:****@db-host:5432/postgres.

The default database name from compose.postgres.yaml is pulp-engine — a hyphenated identifier, which must be double-quoted in SQL. Substitute your own name (and owner role) if you changed POSTGRES_DB / POSTGRES_USER.

# Drop and recreate to avoid leftover rows from the current state.
# WITH (FORCE) terminates any straggler connections (Postgres 13+).
# Safer alternative in production: restore into a fresh database and flip the connection.
psql "$DATABASE_URL_MAINTENANCE" -c 'DROP DATABASE IF EXISTS "pulp-engine" WITH (FORCE);'
psql "$DATABASE_URL_MAINTENANCE" -c 'CREATE DATABASE "pulp-engine" OWNER "pulp-engine";'

pg_restore \
  --dbname="$DATABASE_URL" \
  --no-owner --no-privileges \
  pulp-engine-<timestamp>.dump

3.4 Run migrations

Run prisma migrate deploy against the restored database to bring the schema forward to the current app version (no-op if already current):

docker compose -f compose.postgres.yaml run --rm migrate

3.5 Start the API

docker compose -f compose.postgres.yaml start pulp-engine

Schedules resume automatically: the dispatcher reads schedules.next_run_at from the restored rows.

3.6 Verify

# 1. Structural — manifest integrity check against the live API
pulp-engine backup verify --in ./backup-<timestamp> \
  --api-url http://localhost:3000 --api-key $API_KEY_ADMIN

# 2. Functional — render a known-good template, byte-compare against a reference PDF
curl -X POST http://localhost:3000/render/pdf \
  -H "x-api-key: $API_KEY_RENDER" \
  -H "content-type: application/json" \
  -d '{"template":"known-good","data":{...}}' \
  -o rendered.pdf

Important: backup verify confirms the manifest inventory still matches the live API (template/asset counts and identifying metadata — it computes no checksums). It does not prove that the restore succeeded — that is proven by the sample render in step 2 above.

3.7 Restore rehearsal (continuously exercised)

The § 3.3–3.4 Postgres path is rehearsed by scripts/restore-rehearsal.sh: it seeds a database, captures per-table row counts, runs the exact dump → drop → recreate → pg_restore → prisma migrate deploy sequence from this runbook, and fails if any table’s row count differs after restore. CI runs it on every push to main against the Postgres service container, so a schema change that breaks restorability is caught at merge time, not during an incident. You can run it against any disposable database with DATABASE_URL set — never against production (it drops and recreates the database).

4. CLI tooling — `pulp-engine backup`

Two subcommands are shipped today. A third, pulp-engine backup restore, is a tracked follow-up and intentionally out of scope for this release — see § 5.

`pulp-engine backup create`

Captures a metadata inventory of the live API into a backup directory. Writes exactly one file:

manifest.json — manifest schema version, timestamp, API URL, counts, the template inventory (key, name, currentVersion), and the asset metadata inventory (id, filename, sizeBytes, mimeType)

It does not copy asset binaries and computes no checksums — binaries are backed up separately per § 2.2. Postgres is likewise not dumped by this command; operators run pg_dump separately and the path to the resulting file may be recorded in the manifest via --db-dump <path>.

pulp-engine backup create --out ./backup-20260413 \
  --api-url http://localhost:3000 \
  --api-key $API_KEY_ADMIN \
  [--db-dump ./pulp-engine-20260413.dump]

`pulp-engine backup verify`

Reads manifest.json and diffs it against a live API:

Templates compared by key (added / removed / changed name/currentVersion).
Assets compared by id (added / removed / changed metadata).
--tolerance <n> allows up to n drifted items before the run fails.

This is a metadata drift check — it reads no binaries and computes no checksums. Exit code 0 on match (or drift within tolerance), 2 on drift beyond tolerance, 1 on a missing/unreadable manifest.

pulp-engine backup verify --in ./backup-20260413 \
  --api-url http://localhost:3000 \
  --api-key $API_KEY_ADMIN

5. What this release does NOT include

The following are intentionally out of scope and tracked for a future release:

pulp-engine backup restore write-path command — replaying templates/versions/assets from a backup directory into a live API. Requires admin import/export endpoints with authentication, validation, tenant-scoping, and conflict-resolution semantics that haven’t been designed yet. Restore today is manual (§ 3).
Automated backup scheduler inside the API — Pulp Engine does not manage its own backups; run pg_dump + asset sync from your platform’s standard backup tooling.
Cross-region replication — single-region residency by design. See data-residency-gdpr.md for the multi-region pattern (separate deployments per region).

6. File-mode evaluation deployments

If you are running STORAGE_MODE=file (evaluation / single-instance), back up both data directories — TEMPLATES_DIR holds templates, versions, and audit events (.audit-events.jsonl); ASSETS_DIR holds the asset metadata index (.assets-index.json) and every uploaded asset binary. A backup of TEMPLATES_DIR alone silently loses all asset data.

STAMP=$(date -u +%Y%m%dT%H%M%SZ)
tar -czf pulp-engine-templates-$STAMP.tar.gz \
  -C "$(dirname "$TEMPLATES_DIR")" "$(basename "$TEMPLATES_DIR")"
tar -czf pulp-engine-assets-$STAMP.tar.gz \
  -C "$(dirname "$ASSETS_DIR")" "$(basename "$ASSETS_DIR")"

(When ASSET_BINARY_STORE=s3 is configured alongside file mode, asset binaries live in the bucket instead — back those up per § 2.2; ASSETS_DIR still holds the metadata index.)

Restore: stop the API, extract both tarballs over their directories, start the API. File mode is not HA-safe and is not recommended for production.

← Back to docs index