HA Validation Report
Type: Mixed — Checks 2, 6, 7, 8 are nightly-automated; Checks 1, 3, 4, 5 remain manual. Check 9 (multi-instance S3 read-after-write) is explicitly deferred to a follow-up.
Stack: docker-compose.ha.yml — 2 API replicas + Postgres + MinIO + nginx LB. Demo/evaluation stack; not production-representative.
Checklist: ha-reference-architecture.md § 7.
Coverage at a glance
| # | Check | Status | Driver |
|---|---|---|---|
| 1 | Shared asset readability (upload via api1, render via api2) | Manual | Curl steps below |
| 2 | Schedule fires exactly once across replicas | Automated nightly | scripts/ha/check-2-schedule-fires-once.mjs |
| 3 | Editor token minted on api1 verifies on api2 | Manual | Curl steps below |
| 4 | Graceful degradation when one replica is killed | Manual | Docker compose steps below |
| 5 | Tenant archive propagation within TENANT_STATUS_CACHE_TTL_MS | Manual (multi-tenant only) | Curl steps below |
| 6 | API key rotation with *_PREVIOUS, zero 401s | Automated nightly (editor-key variant); manual playbook for admin-key | scripts/ha/check-6-api-key-rotation.mjs + the playbook below |
| 7 | Single-replica outage failover ≥99% availability | Automated nightly | scripts/ha/check-7-outage-failover.mjs |
| 8 | Redis-backed rate-limit shared bucket across replicas | Automated nightly | scripts/ha/check-8-rate-limit-shared.mjs |
| 9 | Multi-instance S3/asset read-after-write | Deferred — see Deferred coverage below | — |
The four automated checks run nightly via .github/workflows/ha-nightly.yml (cron 14:00 UTC; workflow_dispatch also accepted). Each job builds the image, boots the HA stack via docker-compose.ha.yml + docker-compose.ha.ci.yml (and docker-compose.ha.redis.ci.yml for Check 8), runs the driver, dumps compose logs as an artifact on failure, and tears the stack down.
Latest nightly run: see the HA nightly workflow run history. Pass/fail is the gating signal — this document does not maintain a per-run log because the workflow’s run history already does.
Check 7 history note (2026-06-11, v0.85.0): Check 7 failed every nightly
from its 2026-05-06 landing until 2026-06-09 (~95.2 % availability vs the
99 % threshold). Root cause was load-balancer behaviour, not the product:
docker compose stop blackholes new connections, nginx’s default 60 s
proxy_connect_timeout let probes hang until client abort, and aborts never
count toward max_fails — so the dead upstream was never benched. Fixed in
v0.85.0 (docker-compose.ha.nginx.conf:
1 s connect timeout, bounded proxy_next_upstream retry, explicit
benching). First green runs: 282/282 (100.00 %) on both a dispatch
(run 27278656273) and a real scheduled nightly (run 27286219985),
2026-06-10. Any future red opens a pinned ha-nightly-failure issue
automatically.
Deferred coverage
Check 9 — multi-instance S3/asset read-after-write. Deferred to a follow-up batch. Closes the audit gap flagged at .github/workflows/ha-nightly.yml:40-41 (“Still deferred to a separate follow-up batch: Multi-instance asset/S3 read-after-write check (Workstream D Phase 2)”). The codebase has no existing fixture for multi-instance read-after-write against S3AssetBinaryStore; building one is a mini-project (MinIO/S3 fixture in docker-compose.ha.ci.yml, new scripts/ha/check-9-*.mjs, ha-nightly wiring).
Driving the manual checks
The four checks not yet automated are run by hand against a local stack. Bring up the stack first:
docker compose -f docker-compose.ha.yml up -d
# Wait for both replicas to be ready (api1 + api2 listen on 3000 in-container; the LB exposes 3000 on the host).
Run 1’s compose-env fixes (asset access mode, hardening opt-out for the demo, S3 path style) are already on main — no per-run patching required. If port 3000 is held on your host, ship a one-line override:
services:
lb:
ports: !override
- "3010:80"
Check 1 — shared asset readability
# Upload via api1 (force the hit by talking to the LB with a sticky cookie, or
# directly to the container port if exposed).
curl -X POST http://localhost:3000/assets \
-H "x-api-key: $API_KEY_ADMIN" -F file=@logo.png
# Note the asset id. Render a template that references it, targeting api2 the
# same way. Expect a PDF that embeds the uploaded bytes.
Check 3 — editor token cross-replica
# Mint via api1.
TOKEN=$(curl -s http://localhost:3000/editor-token -H "x-api-key: $API_KEY_EDITOR" | jq -r .token)
# Hit a mutation; the LB may route to api2. Expect 200 + audit row credited to the minter.
curl -X PUT http://localhost:3000/templates/demo \
-H "authorization: Bearer $TOKEN" -H "content-type: application/json" \
-d '{"definition":{...}}'
Check 4 — graceful degradation
docker compose -f docker-compose.ha.yml kill api1
# Retry traffic — expect 200s from api2.
docker compose -f docker-compose.ha.yml start api1
Check 5 — tenant archive propagation (multi-tenant mode only)
# Archive a tenant via api1. Immediately attempt a write via api2; it may still
# succeed for up to TENANT_STATUS_CACHE_TTL_MS. Retry after TTL; expect rejection.
Check 6 — API key rotation (admin-key playbook)
The admin-key rotation contract (docs/api-guide.md:249, docs/deployment-guide.md:83,
memory project_auth_rotation_v1.md): API_KEY_*_PREVIOUS is verify-only for
existing editor session tokens. It cannot mint new tokens and cannot be used
as X-Api-Key. Direct X-Api-Key callers using the OLD key get 401 post-cutover
and must switch.
The editor-key variant of this contract is exercised by scripts/ha/check-6-api-key-rotation.mjs every nightly run; the manual playbook below covers admin-key rotation.
-
Baseline capture.
OLD_ADMIN=<current API_KEY_ADMIN>; generate a new random keyNEW_ADMIN. -
Pre-cutover: mint an editor token with the OLD admin key. This is the token we’ll prove still verifies via
*_PREVIOUSduring the rollover window.OLD_TOKEN=$(curl -s -X POST http://localhost:3010/auth/editor-token \ -H "x-api-key: $OLD_ADMIN" -H "content-type: application/json" \ -d '{"actor":"rotation-test"}' | jq -r .token)PowerShell:
$resp = Invoke-RestMethod -Method Post -Uri http://localhost:3010/auth/editor-token ` -Headers @{ 'x-api-key' = $env:OLD_ADMIN } -ContentType 'application/json' ` -Body '{"actor":"rotation-test"}' $env:OLD_TOKEN = $resp.token -
Pre-cutover sanity: OLD key via
X-Api-Keyreturns 200.curl -o /dev/null -s -w "%{http_code}\n" \ -H "x-api-key: $OLD_ADMIN" http://localhost:3010/templates # 200 -
Record cutover timestamp for the restart-window log check:
T0=$(date +%s) -
Cutover. Copy
docker-compose.ha.override.yml.exampletodocker-compose.ha.override.ymland uncomment the “API key rotation cutover” section. Set shell env varsNEW_ADMINandOLD_ADMIN, then:docker compose -f docker-compose.ha.yml -f docker-compose.ha.override.yml \ up -d --no-deps api1 api2Wait for both
/health/readyto return 200. -
Log evidence — scoped to the restart window, BEFORE running the deliberate negative probes. This zero-count applies only to background traffic during the restart (schedulers, internal probes, in-flight client calls). The deliberate 401 probes in step 7 come AFTER this check so they don’t get folded in:
SINCE=$(( $(date +%s) - T0 )) docker compose logs api1 api2 --since="${SINCE}s" \ | grep -Ec '"msg":"Auth failure"|"statusCode":401' # Expected: 0. Any non-zero value is a real regression — capture # the offending lines. -
Post-cutover deliberate probes. Four explicit calls that prove the rotation contract:
Call Expected Why curl -H "x-api-key: $NEW_ADMIN" /templates200 New key is active. curl -H "x-editor-token: $OLD_TOKEN" /templates200 Old token verifies via *_PREVIOUSwithin rollover.curl -H "x-api-key: $OLD_ADMIN" /templates401 Previous keys cannot be used as X-Api-Key.curl -X POST -H "x-api-key: $OLD_ADMIN" /auth/editor-token401 Previous keys cannot mint. PowerShell probes (same four, same expected codes):
$r1 = Invoke-WebRequest -Method Get -Uri http://localhost:3010/templates ` -Headers @{ 'x-api-key' = $env:NEW_ADMIN } -SkipHttpErrorCheck $r2 = Invoke-WebRequest -Method Get -Uri http://localhost:3010/templates ` -Headers @{ 'x-editor-token' = $env:OLD_TOKEN } -SkipHttpErrorCheck $r3 = Invoke-WebRequest -Method Get -Uri http://localhost:3010/templates ` -Headers @{ 'x-api-key' = $env:OLD_ADMIN } -SkipHttpErrorCheck $r4 = Invoke-WebRequest -Method Post -Uri http://localhost:3010/auth/editor-token ` -Headers @{ 'x-api-key' = $env:OLD_ADMIN } -ContentType 'application/json' ` -Body '{"actor":"rotation-test"}' -SkipHttpErrorCheck $r1.StatusCode, $r2.StatusCode, $r3.StatusCode, $r4.StatusCode # 200 200 401 401 -
Post-rollover window cleanup. Once
EDITOR_TOKEN_TTL_MINUTES(default 480 min / 8h) has elapsed since cutover, removeAPI_KEY_ADMIN_PREVIOUSfrom the override, restart replicas, and confirm:curl -H "x-editor-token: $OLD_TOKEN" /templates→ 401 (old token no longer verifies — previous key is gone).- Minting a fresh editor token with
NEW_ADMINstill works.
Notes for operators
- The HA stack runs single-tenant by design; the multi-tenant variant (
MULTI_TENANT_ENABLED=true) has its own validation surface tracked separately. Check 5 only applies under multi-tenant mode. - The four nightly-automated jobs each take 25–40 minutes including image build and compose cold-start. They run
workflow_dispatchon demand and on a daily cron — they are intentionally not wired to PR / push events. - If a nightly run fails, the compose logs are uploaded as a workflow artifact (
ha-logs-<run_id>/ha-check-<n>-logs-<run_id>, retention 7 days). Check those before re-running. - The compose-env fixes from the original 2026-04-21 manual run (asset access mode, hardening opt-out for the demo, S3 path style) are committed and live on
main; no per-run patching is required.