Pulp Engine — Backup & Restore Runbook
Canonical procedure for operators to back up a Pulp Engine deployment and restore from backup. Pairs with deployment-guide.md (topology + config) and ha-reference-architecture.md (multi-replica deployments).
This runbook assumes a Postgres + object-store topology (the production recommendation). For file-mode evaluation deployments, see § 6.
1. What to back up
Pulp Engine has exactly two durable stores. Everything else (request handlers, editor tokens, capability caches, in-memory schedulers) is derivable from these two.
| Layer | What’s in it | Where it lives |
|---|---|---|
| Postgres | Templates, versions, labels, assets metadata, audit events, schedules + executions + DLQ, tenant registry, render usage | DATABASE_URL — schema in apps/api/src/prisma/schema.prisma |
| Asset binaries | Image files uploaded via the asset library | ASSETS_DIR (filesystem) or the S3 bucket (ASSET_BINARY_STORE=s3) |
Not durable / do not back up:
.env/ secrets — treat these as configuration managed by your platform’s secret manager- Local pod disk (rendered PDFs are transient, Chromium cache is rebuilt on start)
- In-memory state — the tenant status cache reconstructs automatically on restart. Async batch jobs are durable on postgres deployments (as of v0.72.0): the job metadata lives in the
batch_jobsPostgres table and completed result envelopes live inIJobResultBlobStore(filesystem atJOB_RESULT_BLOB_DIRor S3 atJOB_RESULT_BLOB_BUCKET). On restart, pending/processing rows older thanSTARTUP_ORPHAN_GRACE_MSare failed with codejob_abandoned_at_startup; remaining active rows rehydrate into the hot cache. Completed jobs remain pollable untilWEBHOOK_JOB_RETENTION_SECONDSelapses. Theprocessing-timeout sweep still handles crashes that happen later inprocessJob(). File-mode and SQL Server deployments retain the pre-T3 in-memory-only behaviour (jobs lost on pod restart). The delivery dispatcher DLQ is Postgres-backed and survives restart.
2. Backup procedure
Authoritative backup path is pg_dump + a sync of the asset store. The CLI (§ 4) adds a lightweight inventory/verification layer on top — it does not replace pg_dump.
2.1 Postgres
# Full, compressed, custom-format dump (restorable via pg_restore).
pg_dump \
--format=custom \
--no-owner \
--no-privileges \
--file=pulp-engine-$(date -u +%Y%m%dT%H%M%SZ).dump \
"$DATABASE_URL"
Recommended cadence:
- Production: continuous archiving (WAL-E / WAL-G / managed-Postgres PITR) + daily full dump for operator-visible artifacts.
- Evaluation / staging: daily dump is sufficient.
2.2 Asset binaries
S3 mode (ASSET_BINARY_STORE=s3):
# Enable S3 bucket versioning once (recommended — gives you PITR for blobs):
aws s3api put-bucket-versioning --bucket "$S3_BUCKET" \
--versioning-configuration Status=Enabled
# Snapshot copy into a dated backup bucket/prefix:
aws s3 sync "s3://$S3_BUCKET" "s3://$BACKUP_BUCKET/assets-$(date -u +%Y%m%dT%H%M%SZ)/"
Filesystem mode (ASSET_BINARY_STORE=filesystem):
tar -czf assets-$(date -u +%Y%m%dT%H%M%SZ).tar.gz -C "$(dirname "$ASSETS_DIR")" "$(basename "$ASSETS_DIR")"
2.3 Consistency
Pulp Engine writes the asset binary first and the metadata row second within the same request. A backup taken concurrently with active writes can capture:
- An asset binary with no metadata row (harmless — the binary becomes unreferenced; validated-publish will flag on first reference).
- A metadata row with no binary (render will fail fast with
ASSET_BINARY_MISSING— the documented fail-closed behaviour from v0.35.0).
To eliminate the inconsistency window, prefer one of:
- Postgres point-in-time recovery + S3 bucket versioning (belt-and-suspenders — recover both stores to a matching wall-clock moment).
- A brief maintenance window: stop the API pods, run both backups, restart.
For most operators continuous archiving + versioning is sufficient and the fail-closed behaviour on render makes the inconsistency safe to tolerate.
2.4 Inventory + verification (optional)
After a backup run, use the CLI to capture a manifest that can be verified later:
pulp-engine backup create --out ./backup-$(date -u +%Y%m%dT%H%M%SZ) \
--api-url http://localhost:3000 --api-key $API_KEY_ADMIN
This does not dump the database — it captures counts, checksums, and metadata that a later backup verify run can match against. See § 4.
3. Restore procedure
Restore order matters: restore the object store first (so that metadata rows have backing binaries), then Postgres, then restart the API.
3.1 Stop the API
docker compose -f compose.postgres.yaml stop pulp-engine
3.2 Restore asset binaries
S3 mode:
aws s3 sync "s3://$BACKUP_BUCKET/assets-<timestamp>/" "s3://$S3_BUCKET"
Filesystem mode:
rm -rf "$ASSETS_DIR"
mkdir -p "$ASSETS_DIR"
tar -xzf assets-<timestamp>.tar.gz -C "$(dirname "$ASSETS_DIR")"
3.3 Restore Postgres
# Drop and recreate to avoid leftover rows from the current state.
# Safer alternative in production: restore into a fresh database and flip the connection.
psql "$DATABASE_URL_ADMIN" -c 'DROP DATABASE pulp-engine;'
psql "$DATABASE_URL_ADMIN" -c 'CREATE DATABASE pulp-engine OWNER pulp-engine;'
pg_restore \
--dbname="$DATABASE_URL" \
--no-owner --no-privileges \
--clean --if-exists \
pulp-engine-<timestamp>.dump
3.4 Run migrations
Run prisma migrate deploy against the restored database to bring the schema forward to the current app version (no-op if already current):
docker compose -f compose.postgres.yaml run --rm migrate
3.5 Start the API
docker compose -f compose.postgres.yaml start pulp-engine
Schedules resume automatically: the dispatcher reads schedules.next_run_at from the restored rows.
3.6 Verify
# 1. Structural — manifest integrity check against the live API
pulp-engine backup verify --in ./backup-<timestamp> \
--api-url http://localhost:3000 --api-key $API_KEY_ADMIN
# 2. Functional — render a known-good template, byte-compare against a reference PDF
curl -X POST http://localhost:3000/render \
-H "x-api-key: $API_KEY_RENDER" \
-H "content-type: application/json" \
-d '{"templateKey":"known-good","data":{...}}' \
-o rendered.pdf
Important: backup verify confirms the backup artifact is internally consistent (manifest counts + checksums still match the live API). It does not prove that the restore succeeded — that is proven by the sample render in step 2 above.
4. CLI tooling — pulp-engine backup
Two subcommands are shipped today. A third, pulp-engine backup restore, is a tracked follow-up and intentionally out of scope for this release — see § 5.
pulp-engine backup create
Snapshots inventory and asset binaries into a backup directory. Writes:
manifest.json— schema version, API version, counts, checksums, timestampassets.tar.gz— asset binaries (filesystem mode) or an S3-inventory file (S3 mode)
Postgres is not dumped by this command. Operators run pg_dump separately and the path to the resulting file may be recorded in the manifest via --db-dump <path>.
pulp-engine backup create --out ./backup-20260413 \
--api-url http://localhost:3000 \
--api-key $API_KEY_ADMIN \
[--db-dump ./pulp-engine-20260413.dump]
pulp-engine backup verify
Reads manifest.json and checks it against a live API:
- Template/asset counts still match (within an optional
--tolerancewindow). - Checksums of the backed-up asset binaries still match the current asset store.
Exit code 0 on match, non-zero on drift. Use as the last step of a backup run to confirm the artifact is internally consistent.
pulp-engine backup verify --in ./backup-20260413 \
--api-url http://localhost:3000 \
--api-key $API_KEY_ADMIN
5. What this release does NOT include
The following are intentionally out of scope and tracked for a future release:
pulp-engine backup restorewrite-path command — replaying templates/versions/assets from a backup directory into a live API. Requires admin import/export endpoints with authentication, validation, tenant-scoping, and conflict-resolution semantics that haven’t been designed yet. Restore today is manual (§ 3).- Automated backup scheduler inside the API — Pulp Engine does not manage its own backups; run
pg_dump+ asset sync from your platform’s standard backup tooling. - Cross-region replication — single-region residency by design. See data-residency-gdpr.md for the multi-region pattern (separate deployments per region).
6. File-mode evaluation deployments
If you are running STORAGE_MODE=file (evaluation / single-instance), the backup is even simpler:
tar -czf pulp-engine-file-backup-$(date -u +%Y%m%dT%H%M%SZ).tar.gz \
-C "$(dirname "$TEMPLATES_DIR")" "$(basename "$TEMPLATES_DIR")"
TEMPLATES_DIR contains templates, versions, audit events (.audit-events.jsonl), and asset metadata. Restore: stop API, extract tarball over the directory, start API. File mode is not HA-safe and is not recommended for production.