HA Validation Report

Type: Manual smoke test, not an automated regression gate. Rerun after major version upgrades or infrastructure changes to this stack.

Stack: docker-compose.ha.yml — 2 API replicas + Postgres + MinIO + nginx LB. Demo/evaluation stack; not production-representative.

Checklist: ha-reference-architecture.md § 7.

Run log

Each run below records stack versions, operator, date, and pass/fail per check.

Template entry

Date:             YYYY-MM-DD
Operator:         <name>
Pulp Engine version: vX.Y.Z (APP_VERSION)
Postgres:         16-alpine
MinIO:            <tag>
nginx:            1.27-alpine
Host OS:          <os/arch>

Checks:
  [ ] 1. Shared asset readability (upload via api1, render via api2) — PASS/FAIL
  [ ] 2. Schedule fires exactly once across replicas                 — PASS/FAIL
  [ ] 3. Editor token minted on api1 verifies on api2                — PASS/FAIL
  [ ] 4. Graceful degradation when one replica is killed             — PASS/FAIL
  [ ] 5. Tenant archive propagation within TENANT_STATUS_CACHE_TTL_MS — PASS/FAIL
  [ ] 6. API key rotation with *_PREVIOUS, zero 401s                 — PASS/FAIL

Notes / anomalies:
  <free text>

Runs

Run 1 — 2026-04-21 (3/6 PASS, 2/6 deferred, 1/6 N/A)

Date:             2026-04-21
Operator:         Troy Krajancic (with Claude Opus 4.7)
Pulp Engine version: v0.73.0 (commits on credibility-groundwork; not yet tagged)
Image:            local build (GHCR publish blocked — see Notes)
Postgres:         16-alpine
MinIO:            latest (RELEASE.2025-09-07T16-13-09Z)
nginx:            1.27-alpine
Host OS:          Windows 11 Pro 10.0.26200, Docker Desktop 29.4.0
LB host port:     3010 (override; 3000 was held by another container)

Checks:
  [✓] 1. Shared asset readability (upload via LB, fetch from both replicas) — PASS
  [-] 2. Schedule fires exactly once across replicas                       — DEFERRED
  [✓] 3. Editor token minted on api1 verifies on api2                      — PASS
  [✓] 4. Graceful degradation when one replica is killed                   — PASS
  [N/A] 5. Tenant archive propagation within TENANT_STATUS_CACHE_TTL_MS    — N/A (single-tenant stack)
  [-] 6. API key rotation with *_PREVIOUS, zero 401s                       — DEFERRED

Detail per check:
  [1] Uploaded a 535,381-byte PNG via the LB. Fetched the same filename
      from api1 (wget direct → 535,381 bytes) AND api2 (wget direct →
      535,381 bytes). Confirms shared MinIO + Postgres asset metadata
      both replicas read consistently.
  [3] Minted an editor token via api1 (POST /auth/editor-token returned
      a 4-part HMAC token + tenantId=default). Same token verified on
      api2 (GET /templates returned 200 + empty list). HMAC secret is
      shared via EDITOR_TOKEN_SECRET and the token is replica-portable.
  [4] Killed pulpengine-api1-1; sent 5 GET /health requests through the
      LB; all 5 returned 200 (served by api2). Restarted api1; stack
      returned to 2-replica steady state.

Compose-file fixes required to bring up the stack (committed alongside
this run; otherwise the stack crashloops):
  - ASSET_ACCESS_MODE: private
      (without it, S3_PUBLIC_URL is required when S3_ENDPOINT is set)
  - HARDEN_PRODUCTION: "false"
      (the demo stack does not satisfy production hardening gates by
      design — CORS, METRICS_TOKEN, REQUIRE_HTTPS, BLOCK_REMOTE_RESOURCES,
      named-user registry. Real production keeps these on.)
  - S3_PATH_STYLE: "true" instead of S3_FORCE_PATH_STYLE
      (env var renamed during storage hardening; compose was never
      updated → MinIO bucket lookup blew up with ENOTFOUND on the
      virtual-host-style hostname.)

Source-code fix required for asset streaming (committed alongside this run):
  - apps/api/src/storage/asset-binary/s3-asset-binary.store.ts:
      stream() called response.Body.transformToNodeStream() which was
      removed in @aws-sdk/client-s3 3.700+. Replaced with a runtime
      check (Readable | web stream → Readable.fromWeb).

Deferred (recorded, not yet re-run; both now have reproducible
drivers ready for Run 2 — see "How to drive the checks" below):
  [2] Schedule-fires-once — reproducible via
      `scripts/ha/check-2-schedule-fires-once.mjs`
      (`pnpm ha:check-2`). API-only; no DB client required; cleans
      up its own template + schedule.
  [6] API key rotation — reproducible via the complete playbook in
      the "How to drive the checks > Check 6" section below +
      [`docker-compose.ha.override.yml.example`](https://github.com/TroyCoderBoy/pulpengine/blob/main/docker-compose.ha.override.yml.example).

N/A:
  [5] Tenant archive propagation only applies when MULTI_TENANT_ENABLED=true.
      The HA compose stack runs single-tenant by design (the demo path);
      the multi-tenant variant has its own validation surface tracked
      separately.

Notes / anomalies:
  - GHCR publish blocked: the next tagged release should restore the
    publish step in release.yml (billing block per
    .claude/.../project_actions_billing_block.md). Until then, this
    procedure documents how to substitute a local build:
      docker build -t ghcr.io/troycoderboy/pulp-engine:latest .
      docker compose -f docker-compose.ha.yml up -d
  - Port-3000 conflict: another container (open-webui) had the host
    port. Used a one-line compose override:
      services:
        lb:
          ports: !override
            - "3010:80"
    Ship as docker-compose.ha.override.yml or document.

How to drive the checks

The repo does not yet ship an automated HA harness. Each check is executed manually against the running stack. Minimal commands:

Check 1 — shared asset readability

# Upload via api1 (force the hit by talking to the LB with a sticky cookie, or
# directly to the container port if exposed)
curl -X POST http://localhost:3000/assets \
  -H "x-api-key: $API_KEY_ADMIN" -F file=@logo.png
# Note the asset id. Render a template that references it, targeting api2 the
# same way. Expect a PDF that embeds the uploaded bytes.

Check 2 — schedule fires exactly once. Run the reproducible harness at scripts/ha/check-2-schedule-fires-once.mjs — API-only, no DB access required. It seeds a temp template + per-minute schedule, observes ~200s (3 ticks), asserts each fireTime appears exactly once, and tears its own fixtures down.

bash / WSL / macOS / Linux:

LB_BASE_URL=http://localhost:3010 API_KEY_ADMIN=<key> pnpm ha:check-2
# Expect: PASS — 3 ticks, 1 fire each.

PowerShell (Windows):

$env:LB_BASE_URL = 'http://localhost:3010'
$env:API_KEY_ADMIN = '<key>'
pnpm ha:check-2

If you prefer direct SQL for spot-checking a live stack (not required for Run 2):

# Reads the underlying fire_time column the harness inspects via the API.
psql $DATABASE_URL -c "SELECT schedule_id, fire_time, count(*) FROM schedule_executions
  WHERE fire_time > now() - interval '5 minutes' GROUP BY schedule_id, fire_time;"
# Expect exactly one row per (schedule_id, fire_time), regardless of replica count.

Check 3 — editor token cross-replica

# Mint via api1
TOKEN=$(curl -s http://localhost:3000/editor-token -H "x-api-key: $API_KEY_EDITOR" | jq -r .token)
# Hit a mutation; LB may route to api2. Expect 200 + audit row credited to the minter.
curl -X PUT http://localhost:3000/templates/demo \
  -H "authorization: Bearer $TOKEN" -H "content-type: application/json" \
  -d '{"definition":{...}}'

Check 4 — graceful degradation

docker compose -f docker-compose.ha.yml kill api1
# Retry traffic — expect 200s from api2.
docker compose -f docker-compose.ha.yml start api1

Check 5 — tenant archive propagation (multi-tenant mode only)

# Archive a tenant via api1. Immediately attempt a write via api2; it may still
# succeed for up to TENANT_STATUS_CACHE_TTL_MS. Retry after TTL; expect rejection.

Check 6 — API key rotation (complete playbook).

The rotation contract (docs/api-guide.md:249, docs/deployment-guide.md:83, memory project_auth_rotation_v1.md): API_KEY_*_PREVIOUS is verify-only for existing editor session tokens. It cannot mint new tokens and cannot be used as X-Api-Key. Direct X-Api-Key callers using the OLD key get 401 post-cutover and must switch.

Baseline capture. OLD_ADMIN=<current API_KEY_ADMIN>; generate a new random key NEW_ADMIN.

Pre-cutover: mint an editor token with the OLD admin key. This is the token we’ll prove still verifies via *_PREVIOUS during the rollover window.

OLD_TOKEN=$(curl -s -X POST http://localhost:3010/auth/editor-token \
  -H "x-api-key: $OLD_ADMIN" -H "content-type: application/json" \
  -d '{"actor":"rotation-test"}' | jq -r .token)

PowerShell:

$resp = Invoke-RestMethod -Method Post -Uri http://localhost:3010/auth/editor-token `
  -Headers @{ 'x-api-key' = $env:OLD_ADMIN } -ContentType 'application/json' `
  -Body '{"actor":"rotation-test"}'
$env:OLD_TOKEN = $resp.token

Pre-cutover sanity: OLD key via X-Api-Key returns 200.

curl -o /dev/null -s -w "%{http_code}\n" \
  -H "x-api-key: $OLD_ADMIN" http://localhost:3010/templates
# 200

Record cutover timestamp for the restart-window log check:
```
T0=$(date +%s)
```
Cutover. Copy docker-compose.ha.override.yml.example to docker-compose.ha.override.yml and uncomment the “API key rotation cutover” section. Set shell env vars NEW_ADMIN and OLD_ADMIN, then:
```
docker compose -f docker-compose.ha.yml -f docker-compose.ha.override.yml \
  up -d --no-deps api1 api2
```
Wait for both /health/ready to return 200.
Log evidence — scoped to the restart window, BEFORE running the deliberate negative probes. This zero-count applies only to background traffic during the restart (schedulers, internal probes, in-flight client calls). The deliberate 401 probes in step 7 come AFTER this check so they don’t get folded in:
```
SINCE=$(( $(date +%s) - T0 ))
docker compose logs api1 api2 --since="${SINCE}s" \
  | grep -Ec '"msg":"Auth failure"|"statusCode":401'
# Expected: 0. Any non-zero value is a real regression — capture
# the offending lines for the Run 2 report.
```

Post-cutover deliberate probes. Four explicit calls that prove the rotation contract:

Call	Expected	Why
`curl -H "x-api-key: $NEW_ADMIN" /templates`	200	New key is active.
`curl -H "x-editor-token: $OLD_TOKEN" /templates`	200	Old token verifies via `*_PREVIOUS` within rollover.
`curl -H "x-api-key: $OLD_ADMIN" /templates`	401	Previous keys cannot be used as `X-Api-Key`.
`curl -X POST -H "x-api-key: $OLD_ADMIN" /auth/editor-token`	401	Previous keys cannot mint.

PowerShell probes (same four, same expected codes):

$r1 = Invoke-WebRequest -Method Get  -Uri http://localhost:3010/templates `
        -Headers @{ 'x-api-key' = $env:NEW_ADMIN } -SkipHttpErrorCheck
$r2 = Invoke-WebRequest -Method Get  -Uri http://localhost:3010/templates `
        -Headers @{ 'x-editor-token' = $env:OLD_TOKEN } -SkipHttpErrorCheck
$r3 = Invoke-WebRequest -Method Get  -Uri http://localhost:3010/templates `
        -Headers @{ 'x-api-key' = $env:OLD_ADMIN } -SkipHttpErrorCheck
$r4 = Invoke-WebRequest -Method Post -Uri http://localhost:3010/auth/editor-token `
        -Headers @{ 'x-api-key' = $env:OLD_ADMIN } -ContentType 'application/json' `
        -Body '{"actor":"rotation-test"}' -SkipHttpErrorCheck
$r1.StatusCode, $r2.StatusCode, $r3.StatusCode, $r4.StatusCode
# 200 200 401 401

Post-rollover window cleanup. Once EDITOR_TOKEN_TTL_MINUTES (default 480 min / 8h) has elapsed since cutover, remove API_KEY_ADMIN_PREVIOUS from the override, restart replicas, and confirm:
- curl -H "x-editor-token: $OLD_TOKEN" /templates → 401 (old token no longer verifies — previous key is gone).
- Minting a fresh editor token with NEW_ADMIN still works.

Follow-up

A full HA harness (compose up → run all 6 checks → tear down, with pass/fail assertions) remains a separate initiative. The Check-2 script at scripts/ha/check-2-schedule-fires-once.mjs is the first building block — API-only, self-cleaning, reusable under any deployment that exposes the admin API. Check 6 stays manual because the rotation contract requires an env-swap + restart cycle that doesn’t script cleanly without touching the compose file. Checks 1/3/4 passed in Run 1 and are only rerun if the underlying code paths change.

Run 2 template

Fill in and append to the Runs section when executing the two deferred checks against a live stack.

#### Run 2 — YYYY-MM-DD (targeted rerun for the 2 deferred checks)

Date:             YYYY-MM-DD
Operator:         <name> (with Claude Opus 4.7)
Pulp Engine version: v0.74.0 (APP_VERSION) — local build (GHCR publish
                     pending — see docs/runbooks/ghcr-republish.md)
Stack deltas from Run 1: none; compose-env fixes from Run 1 are live.

Checks (only the two previously deferred are in scope):
  [✓/✗] 2. Schedule fires exactly once across replicas — driven by
           pnpm ha:check-2 (scripts/ha/check-2-schedule-fires-once.mjs).
           Attach the PASS line + the raw executions rows the harness
           printed.
  [✓/✗] 6. API key rotation with *_PREVIOUS — driven by the playbook
           in "How to drive the checks > Check 6" above +
           docker-compose.ha.override.yml.example.
           Attach:
             - The restart-window log count (step 6, expected 0).
             - The 4-row 200/200/401/401 probe matrix (step 7).

Notes / anomalies:
  <free text>

← Back to docs index