Pulp Engine Document Rendering
Get started

HA Validation Report

Type: Mixed — Checks 2, 6, 7, 8 are nightly-automated; Checks 1, 3, 4, 5 remain manual. Check 9 (multi-instance S3 read-after-write) is explicitly deferred to a follow-up.

Stack: docker-compose.ha.yml — 2 API replicas + Postgres + MinIO + nginx LB. Demo/evaluation stack; not production-representative.

Checklist: ha-reference-architecture.md § 7.


Coverage at a glance

#CheckStatusDriver
1Shared asset readability (upload via api1, render via api2)ManualCurl steps below
2Schedule fires exactly once across replicasAutomated nightlyscripts/ha/check-2-schedule-fires-once.mjs
3Editor token minted on api1 verifies on api2ManualCurl steps below
4Graceful degradation when one replica is killedManualDocker compose steps below
5Tenant archive propagation within TENANT_STATUS_CACHE_TTL_MSManual (multi-tenant only)Curl steps below
6API key rotation with *_PREVIOUS, zero 401sAutomated nightly (editor-key variant); manual playbook for admin-keyscripts/ha/check-6-api-key-rotation.mjs + the playbook below
7Single-replica outage failover ≥99% availabilityAutomated nightlyscripts/ha/check-7-outage-failover.mjs
8Redis-backed rate-limit shared bucket across replicasAutomated nightlyscripts/ha/check-8-rate-limit-shared.mjs
9Multi-instance S3/asset read-after-writeDeferred — see Deferred coverage below

The four automated checks run nightly via .github/workflows/ha-nightly.yml (cron 14:00 UTC; workflow_dispatch also accepted). Each job builds the image, boots the HA stack via docker-compose.ha.yml + docker-compose.ha.ci.yml (and docker-compose.ha.redis.ci.yml for Check 8), runs the driver, dumps compose logs as an artifact on failure, and tears the stack down.

Latest nightly run: see the HA nightly workflow run history. Pass/fail is the gating signal — this document does not maintain a per-run log because the workflow’s run history already does.

Check 7 history note (2026-06-11, v0.85.0): Check 7 failed every nightly from its 2026-05-06 landing until 2026-06-09 (~95.2 % availability vs the 99 % threshold). Root cause was load-balancer behaviour, not the product: docker compose stop blackholes new connections, nginx’s default 60 s proxy_connect_timeout let probes hang until client abort, and aborts never count toward max_fails — so the dead upstream was never benched. Fixed in v0.85.0 (docker-compose.ha.nginx.conf: 1 s connect timeout, bounded proxy_next_upstream retry, explicit benching). First green runs: 282/282 (100.00 %) on both a dispatch (run 27278656273) and a real scheduled nightly (run 27286219985), 2026-06-10. Any future red opens a pinned ha-nightly-failure issue automatically.


Deferred coverage

Check 9 — multi-instance S3/asset read-after-write. Deferred to a follow-up batch. Closes the audit gap flagged at .github/workflows/ha-nightly.yml:40-41 (“Still deferred to a separate follow-up batch: Multi-instance asset/S3 read-after-write check (Workstream D Phase 2)”). The codebase has no existing fixture for multi-instance read-after-write against S3AssetBinaryStore; building one is a mini-project (MinIO/S3 fixture in docker-compose.ha.ci.yml, new scripts/ha/check-9-*.mjs, ha-nightly wiring).


Driving the manual checks

The four checks not yet automated are run by hand against a local stack. Bring up the stack first:

docker compose -f docker-compose.ha.yml up -d
# Wait for both replicas to be ready (api1 + api2 listen on 3000 in-container; the LB exposes 3000 on the host).

Run 1’s compose-env fixes (asset access mode, hardening opt-out for the demo, S3 path style) are already on main — no per-run patching required. If port 3000 is held on your host, ship a one-line override:

services:
  lb:
    ports: !override
      - "3010:80"

Check 1 — shared asset readability

# Upload via api1 (force the hit by talking to the LB with a sticky cookie, or
# directly to the container port if exposed).
curl -X POST http://localhost:3000/assets \
  -H "x-api-key: $API_KEY_ADMIN" -F file=@logo.png
# Note the asset id. Render a template that references it, targeting api2 the
# same way. Expect a PDF that embeds the uploaded bytes.

Check 3 — editor token cross-replica

# Mint via api1.
TOKEN=$(curl -s http://localhost:3000/editor-token -H "x-api-key: $API_KEY_EDITOR" | jq -r .token)
# Hit a mutation; the LB may route to api2. Expect 200 + audit row credited to the minter.
curl -X PUT http://localhost:3000/templates/demo \
  -H "authorization: Bearer $TOKEN" -H "content-type: application/json" \
  -d '{"definition":{...}}'

Check 4 — graceful degradation

docker compose -f docker-compose.ha.yml kill api1
# Retry traffic — expect 200s from api2.
docker compose -f docker-compose.ha.yml start api1

Check 5 — tenant archive propagation (multi-tenant mode only)

# Archive a tenant via api1. Immediately attempt a write via api2; it may still
# succeed for up to TENANT_STATUS_CACHE_TTL_MS. Retry after TTL; expect rejection.

Check 6 — API key rotation (admin-key playbook)

The admin-key rotation contract (docs/api-guide.md:249, docs/deployment-guide.md:83, memory project_auth_rotation_v1.md): API_KEY_*_PREVIOUS is verify-only for existing editor session tokens. It cannot mint new tokens and cannot be used as X-Api-Key. Direct X-Api-Key callers using the OLD key get 401 post-cutover and must switch.

The editor-key variant of this contract is exercised by scripts/ha/check-6-api-key-rotation.mjs every nightly run; the manual playbook below covers admin-key rotation.

  1. Baseline capture. OLD_ADMIN=<current API_KEY_ADMIN>; generate a new random key NEW_ADMIN.

  2. Pre-cutover: mint an editor token with the OLD admin key. This is the token we’ll prove still verifies via *_PREVIOUS during the rollover window.

    OLD_TOKEN=$(curl -s -X POST http://localhost:3010/auth/editor-token \
      -H "x-api-key: $OLD_ADMIN" -H "content-type: application/json" \
      -d '{"actor":"rotation-test"}' | jq -r .token)

    PowerShell:

    $resp = Invoke-RestMethod -Method Post -Uri http://localhost:3010/auth/editor-token `
      -Headers @{ 'x-api-key' = $env:OLD_ADMIN } -ContentType 'application/json' `
      -Body '{"actor":"rotation-test"}'
    $env:OLD_TOKEN = $resp.token
  3. Pre-cutover sanity: OLD key via X-Api-Key returns 200.

    curl -o /dev/null -s -w "%{http_code}\n" \
      -H "x-api-key: $OLD_ADMIN" http://localhost:3010/templates
    # 200
  4. Record cutover timestamp for the restart-window log check:

    T0=$(date +%s)
  5. Cutover. Copy docker-compose.ha.override.yml.example to docker-compose.ha.override.yml and uncomment the “API key rotation cutover” section. Set shell env vars NEW_ADMIN and OLD_ADMIN, then:

    docker compose -f docker-compose.ha.yml -f docker-compose.ha.override.yml \
      up -d --no-deps api1 api2

    Wait for both /health/ready to return 200.

  6. Log evidence — scoped to the restart window, BEFORE running the deliberate negative probes. This zero-count applies only to background traffic during the restart (schedulers, internal probes, in-flight client calls). The deliberate 401 probes in step 7 come AFTER this check so they don’t get folded in:

    SINCE=$(( $(date +%s) - T0 ))
    docker compose logs api1 api2 --since="${SINCE}s" \
      | grep -Ec '"msg":"Auth failure"|"statusCode":401'
    # Expected: 0. Any non-zero value is a real regression — capture
    # the offending lines.
  7. Post-cutover deliberate probes. Four explicit calls that prove the rotation contract:

    CallExpectedWhy
    curl -H "x-api-key: $NEW_ADMIN" /templates200New key is active.
    curl -H "x-editor-token: $OLD_TOKEN" /templates200Old token verifies via *_PREVIOUS within rollover.
    curl -H "x-api-key: $OLD_ADMIN" /templates401Previous keys cannot be used as X-Api-Key.
    curl -X POST -H "x-api-key: $OLD_ADMIN" /auth/editor-token401Previous keys cannot mint.

    PowerShell probes (same four, same expected codes):

    $r1 = Invoke-WebRequest -Method Get  -Uri http://localhost:3010/templates `
            -Headers @{ 'x-api-key' = $env:NEW_ADMIN } -SkipHttpErrorCheck
    $r2 = Invoke-WebRequest -Method Get  -Uri http://localhost:3010/templates `
            -Headers @{ 'x-editor-token' = $env:OLD_TOKEN } -SkipHttpErrorCheck
    $r3 = Invoke-WebRequest -Method Get  -Uri http://localhost:3010/templates `
            -Headers @{ 'x-api-key' = $env:OLD_ADMIN } -SkipHttpErrorCheck
    $r4 = Invoke-WebRequest -Method Post -Uri http://localhost:3010/auth/editor-token `
            -Headers @{ 'x-api-key' = $env:OLD_ADMIN } -ContentType 'application/json' `
            -Body '{"actor":"rotation-test"}' -SkipHttpErrorCheck
    $r1.StatusCode, $r2.StatusCode, $r3.StatusCode, $r4.StatusCode
    # 200 200 401 401
  8. Post-rollover window cleanup. Once EDITOR_TOKEN_TTL_MINUTES (default 480 min / 8h) has elapsed since cutover, remove API_KEY_ADMIN_PREVIOUS from the override, restart replicas, and confirm:

    • curl -H "x-editor-token: $OLD_TOKEN" /templates401 (old token no longer verifies — previous key is gone).
    • Minting a fresh editor token with NEW_ADMIN still works.

Notes for operators

  • The HA stack runs single-tenant by design; the multi-tenant variant (MULTI_TENANT_ENABLED=true) has its own validation surface tracked separately. Check 5 only applies under multi-tenant mode.
  • The four nightly-automated jobs each take 25–40 minutes including image build and compose cold-start. They run workflow_dispatch on demand and on a daily cron — they are intentionally not wired to PR / push events.
  • If a nightly run fails, the compose logs are uploaded as a workflow artifact (ha-logs-<run_id> / ha-check-<n>-logs-<run_id>, retention 7 days). Check those before re-running.
  • The compose-env fixes from the original 2026-04-21 manual run (asset access mode, hardening opt-out for the demo, S3 path style) are committed and live on main; no per-run patching is required.