HA Validation Report
Type: Manual smoke test, not an automated regression gate. Rerun after major version upgrades or infrastructure changes to this stack.
Stack: docker-compose.ha.yml — 2 API replicas + Postgres + MinIO + nginx LB. Demo/evaluation stack; not production-representative.
Checklist: ha-reference-architecture.md § 7.
Run log
Each run below records stack versions, operator, date, and pass/fail per check.
Template entry
Date: YYYY-MM-DD
Operator: <name>
Pulp Engine version: vX.Y.Z (APP_VERSION)
Postgres: 16-alpine
MinIO: <tag>
nginx: 1.27-alpine
Host OS: <os/arch>
Checks:
[ ] 1. Shared asset readability (upload via api1, render via api2) — PASS/FAIL
[ ] 2. Schedule fires exactly once across replicas — PASS/FAIL
[ ] 3. Editor token minted on api1 verifies on api2 — PASS/FAIL
[ ] 4. Graceful degradation when one replica is killed — PASS/FAIL
[ ] 5. Tenant archive propagation within TENANT_STATUS_CACHE_TTL_MS — PASS/FAIL
[ ] 6. API key rotation with *_PREVIOUS, zero 401s — PASS/FAIL
Notes / anomalies:
<free text>
Runs
Run 1 — 2026-04-21 (3/6 PASS, 2/6 deferred, 1/6 N/A)
Date: 2026-04-21
Operator: Troy Krajancic (with Claude Opus 4.7)
Pulp Engine version: v0.73.0 (commits on credibility-groundwork; not yet tagged)
Image: local build (GHCR publish blocked — see Notes)
Postgres: 16-alpine
MinIO: latest (RELEASE.2025-09-07T16-13-09Z)
nginx: 1.27-alpine
Host OS: Windows 11 Pro 10.0.26200, Docker Desktop 29.4.0
LB host port: 3010 (override; 3000 was held by another container)
Checks:
[✓] 1. Shared asset readability (upload via LB, fetch from both replicas) — PASS
[-] 2. Schedule fires exactly once across replicas — DEFERRED
[✓] 3. Editor token minted on api1 verifies on api2 — PASS
[✓] 4. Graceful degradation when one replica is killed — PASS
[N/A] 5. Tenant archive propagation within TENANT_STATUS_CACHE_TTL_MS — N/A (single-tenant stack)
[-] 6. API key rotation with *_PREVIOUS, zero 401s — DEFERRED
Detail per check:
[1] Uploaded a 535,381-byte PNG via the LB. Fetched the same filename
from api1 (wget direct → 535,381 bytes) AND api2 (wget direct →
535,381 bytes). Confirms shared MinIO + Postgres asset metadata
both replicas read consistently.
[3] Minted an editor token via api1 (POST /auth/editor-token returned
a 4-part HMAC token + tenantId=default). Same token verified on
api2 (GET /templates returned 200 + empty list). HMAC secret is
shared via EDITOR_TOKEN_SECRET and the token is replica-portable.
[4] Killed pulpengine-api1-1; sent 5 GET /health requests through the
LB; all 5 returned 200 (served by api2). Restarted api1; stack
returned to 2-replica steady state.
Compose-file fixes required to bring up the stack (committed alongside
this run; otherwise the stack crashloops):
- ASSET_ACCESS_MODE: private
(without it, S3_PUBLIC_URL is required when S3_ENDPOINT is set)
- HARDEN_PRODUCTION: "false"
(the demo stack does not satisfy production hardening gates by
design — CORS, METRICS_TOKEN, REQUIRE_HTTPS, BLOCK_REMOTE_RESOURCES,
named-user registry. Real production keeps these on.)
- S3_PATH_STYLE: "true" instead of S3_FORCE_PATH_STYLE
(env var renamed during storage hardening; compose was never
updated → MinIO bucket lookup blew up with ENOTFOUND on the
virtual-host-style hostname.)
Source-code fix required for asset streaming (committed alongside this run):
- apps/api/src/storage/asset-binary/s3-asset-binary.store.ts:
stream() called response.Body.transformToNodeStream() which was
removed in @aws-sdk/client-s3 3.700+. Replaced with a runtime
check (Readable | web stream → Readable.fromWeb).
Deferred (recorded, not yet re-run; both now have reproducible
drivers ready for Run 2 — see "How to drive the checks" below):
[2] Schedule-fires-once — reproducible via
`scripts/ha/check-2-schedule-fires-once.mjs`
(`pnpm ha:check-2`). API-only; no DB client required; cleans
up its own template + schedule.
[6] API key rotation — reproducible via the complete playbook in
the "How to drive the checks > Check 6" section below +
[`docker-compose.ha.override.yml.example`](https://github.com/TroyCoderBoy/pulpengine/blob/main/docker-compose.ha.override.yml.example).
N/A:
[5] Tenant archive propagation only applies when MULTI_TENANT_ENABLED=true.
The HA compose stack runs single-tenant by design (the demo path);
the multi-tenant variant has its own validation surface tracked
separately.
Notes / anomalies:
- GHCR publish blocked: the next tagged release should restore the
publish step in release.yml (billing block per
.claude/.../project_actions_billing_block.md). Until then, this
procedure documents how to substitute a local build:
docker build -t ghcr.io/troycoderboy/pulp-engine:latest .
docker compose -f docker-compose.ha.yml up -d
- Port-3000 conflict: another container (open-webui) had the host
port. Used a one-line compose override:
services:
lb:
ports: !override
- "3010:80"
Ship as docker-compose.ha.override.yml or document.
How to drive the checks
The repo does not yet ship an automated HA harness. Each check is executed manually against the running stack. Minimal commands:
Check 1 — shared asset readability
# Upload via api1 (force the hit by talking to the LB with a sticky cookie, or
# directly to the container port if exposed)
curl -X POST http://localhost:3000/assets \
-H "x-api-key: $API_KEY_ADMIN" -F file=@logo.png
# Note the asset id. Render a template that references it, targeting api2 the
# same way. Expect a PDF that embeds the uploaded bytes.
Check 2 — schedule fires exactly once. Run the reproducible harness at scripts/ha/check-2-schedule-fires-once.mjs — API-only, no DB access required. It seeds a temp template + per-minute schedule, observes ~200s (3 ticks), asserts each fireTime appears exactly once, and tears its own fixtures down.
bash / WSL / macOS / Linux:
LB_BASE_URL=http://localhost:3010 API_KEY_ADMIN=<key> pnpm ha:check-2
# Expect: PASS — 3 ticks, 1 fire each.
PowerShell (Windows):
$env:LB_BASE_URL = 'http://localhost:3010'
$env:API_KEY_ADMIN = '<key>'
pnpm ha:check-2
If you prefer direct SQL for spot-checking a live stack (not required for Run 2):
# Reads the underlying fire_time column the harness inspects via the API.
psql $DATABASE_URL -c "SELECT schedule_id, fire_time, count(*) FROM schedule_executions
WHERE fire_time > now() - interval '5 minutes' GROUP BY schedule_id, fire_time;"
# Expect exactly one row per (schedule_id, fire_time), regardless of replica count.
Check 3 — editor token cross-replica
# Mint via api1
TOKEN=$(curl -s http://localhost:3000/editor-token -H "x-api-key: $API_KEY_EDITOR" | jq -r .token)
# Hit a mutation; LB may route to api2. Expect 200 + audit row credited to the minter.
curl -X PUT http://localhost:3000/templates/demo \
-H "authorization: Bearer $TOKEN" -H "content-type: application/json" \
-d '{"definition":{...}}'
Check 4 — graceful degradation
docker compose -f docker-compose.ha.yml kill api1
# Retry traffic — expect 200s from api2.
docker compose -f docker-compose.ha.yml start api1
Check 5 — tenant archive propagation (multi-tenant mode only)
# Archive a tenant via api1. Immediately attempt a write via api2; it may still
# succeed for up to TENANT_STATUS_CACHE_TTL_MS. Retry after TTL; expect rejection.
Check 6 — API key rotation (complete playbook).
The rotation contract (docs/api-guide.md:249, docs/deployment-guide.md:83,
memory project_auth_rotation_v1.md): API_KEY_*_PREVIOUS is
verify-only for existing editor session tokens. It cannot mint new
tokens and cannot be used as X-Api-Key. Direct X-Api-Key callers
using the OLD key get 401 post-cutover and must switch.
-
Baseline capture.
OLD_ADMIN=<current API_KEY_ADMIN>; generate a new random keyNEW_ADMIN. -
Pre-cutover: mint an editor token with the OLD admin key. This is the token we’ll prove still verifies via
*_PREVIOUSduring the rollover window.OLD_TOKEN=$(curl -s -X POST http://localhost:3010/auth/editor-token \ -H "x-api-key: $OLD_ADMIN" -H "content-type: application/json" \ -d '{"actor":"rotation-test"}' | jq -r .token)PowerShell:
$resp = Invoke-RestMethod -Method Post -Uri http://localhost:3010/auth/editor-token ` -Headers @{ 'x-api-key' = $env:OLD_ADMIN } -ContentType 'application/json' ` -Body '{"actor":"rotation-test"}' $env:OLD_TOKEN = $resp.token -
Pre-cutover sanity: OLD key via
X-Api-Keyreturns 200.curl -o /dev/null -s -w "%{http_code}\n" \ -H "x-api-key: $OLD_ADMIN" http://localhost:3010/templates # 200 -
Record cutover timestamp for the restart-window log check:
T0=$(date +%s) -
Cutover. Copy
docker-compose.ha.override.yml.exampletodocker-compose.ha.override.ymland uncomment the “API key rotation cutover” section. Set shell env varsNEW_ADMINandOLD_ADMIN, then:docker compose -f docker-compose.ha.yml -f docker-compose.ha.override.yml \ up -d --no-deps api1 api2Wait for both
/health/readyto return 200. -
Log evidence — scoped to the restart window, BEFORE running the deliberate negative probes. This zero-count applies only to background traffic during the restart (schedulers, internal probes, in-flight client calls). The deliberate 401 probes in step 7 come AFTER this check so they don’t get folded in:
SINCE=$(( $(date +%s) - T0 )) docker compose logs api1 api2 --since="${SINCE}s" \ | grep -Ec '"msg":"Auth failure"|"statusCode":401' # Expected: 0. Any non-zero value is a real regression — capture # the offending lines for the Run 2 report. -
Post-cutover deliberate probes. Four explicit calls that prove the rotation contract:
Call Expected Why curl -H "x-api-key: $NEW_ADMIN" /templates200 New key is active. curl -H "x-editor-token: $OLD_TOKEN" /templates200 Old token verifies via *_PREVIOUSwithin rollover.curl -H "x-api-key: $OLD_ADMIN" /templates401 Previous keys cannot be used as X-Api-Key.curl -X POST -H "x-api-key: $OLD_ADMIN" /auth/editor-token401 Previous keys cannot mint. PowerShell probes (same four, same expected codes):
$r1 = Invoke-WebRequest -Method Get -Uri http://localhost:3010/templates ` -Headers @{ 'x-api-key' = $env:NEW_ADMIN } -SkipHttpErrorCheck $r2 = Invoke-WebRequest -Method Get -Uri http://localhost:3010/templates ` -Headers @{ 'x-editor-token' = $env:OLD_TOKEN } -SkipHttpErrorCheck $r3 = Invoke-WebRequest -Method Get -Uri http://localhost:3010/templates ` -Headers @{ 'x-api-key' = $env:OLD_ADMIN } -SkipHttpErrorCheck $r4 = Invoke-WebRequest -Method Post -Uri http://localhost:3010/auth/editor-token ` -Headers @{ 'x-api-key' = $env:OLD_ADMIN } -ContentType 'application/json' ` -Body '{"actor":"rotation-test"}' -SkipHttpErrorCheck $r1.StatusCode, $r2.StatusCode, $r3.StatusCode, $r4.StatusCode # 200 200 401 401 -
Post-rollover window cleanup. Once
EDITOR_TOKEN_TTL_MINUTES(default 480 min / 8h) has elapsed since cutover, removeAPI_KEY_ADMIN_PREVIOUSfrom the override, restart replicas, and confirm:curl -H "x-editor-token: $OLD_TOKEN" /templates→ 401 (old token no longer verifies — previous key is gone).- Minting a fresh editor token with
NEW_ADMINstill works.
Follow-up
A full HA harness (compose up → run all 6 checks → tear down, with pass/fail assertions) remains a separate initiative. The Check-2 script at scripts/ha/check-2-schedule-fires-once.mjs is the first building block — API-only, self-cleaning, reusable under any deployment that exposes the admin API. Check 6 stays manual because the rotation contract requires an env-swap + restart cycle that doesn’t script cleanly without touching the compose file. Checks 1/3/4 passed in Run 1 and are only rerun if the underlying code paths change.
Run 2 template
Fill in and append to the Runs section when executing the two deferred checks against a live stack.
#### Run 2 — YYYY-MM-DD (targeted rerun for the 2 deferred checks)
Date: YYYY-MM-DD
Operator: <name> (with Claude Opus 4.7)
Pulp Engine version: v0.74.0 (APP_VERSION) — local build (GHCR publish
pending — see docs/runbooks/ghcr-republish.md)
Stack deltas from Run 1: none; compose-env fixes from Run 1 are live.
Checks (only the two previously deferred are in scope):
[✓/✗] 2. Schedule fires exactly once across replicas — driven by
pnpm ha:check-2 (scripts/ha/check-2-schedule-fires-once.mjs).
Attach the PASS line + the raw executions rows the harness
printed.
[✓/✗] 6. API key rotation with *_PREVIOUS — driven by the playbook
in "How to drive the checks > Check 6" above +
docker-compose.ha.override.yml.example.
Attach:
- The restart-window log count (step 6, expected 0).
- The 4-row 200/200/401/401 probe matrix (step 7).
Notes / anomalies:
<free text>