Pulp Engine — Benchmark Pack
This pack is the authoritative reference for Pulp Engine’s render throughput and latency characteristics. It measures what procurement readers actually need to know before adopting the product: how fast a warm pool renders the common page-count buckets, how long a cold start costs, how latency varies across output formats, and how throughput scales with additional API replicas.
Everything here is reproducible. The compose stack, the template pack, and the harness script are all committed to the repo; the numbers in this document are the output of running that exact harness against one documented machine (“the rig”). Re-run it on your own hardware to get numbers that match your deployment.
Hardware — “the rig”
The published numbers come from a single machine. This is “the rig”, not “typical production”. Numbers on your hardware will differ — sometimes materially. The harness is the artifact that travels; the numbers are a snapshot.
| Item | Value |
|---|---|
| CPU | 13th Gen Intel Core i9-13900KF — 24 cores / 32 threads, 3.0 GHz base |
| RAM | 64 GB |
| Disk | Samsung 980 PRO 1 TB NVMe |
| OS | Windows 11 Pro (build 26200) |
| Docker | 29.3.1, Docker Desktop on WSL2 |
| Pulp Engine image | pulp-engine:bench-head — local build from commit 83f11a2 (post-T3, pre-v0.72.0 tag) |
All containers run on the same host. Postgres, MinIO, nginx, and Prometheus are sibling services on the default bridge network. Latency here reflects Chromium + the template pipeline + local-loopback networking + Docker-Desktop-on-Windows (WSL2) overhead.
Note on the WSL2 overhead. Docker Desktop on Windows routes all container workloads through a Linux VM; Chromium is significantly slower under that layer than on a native Linux host. A run on a native Linux host (bare-metal or an EC2 instance) will typically show materially better PDF latency than the numbers below. For procurement decisions, re-run the harness on hardware that matches your intended deployment.
What the pack measures
Six cells, covering the main procurement questions:
| Cell name | Configuration | Why it matters |
|---|---|---|
pdf-1p | PDF, 1-page invoice, warm pool, concurrency 5 | The canonical “single small render” number. What most transactional workloads look like. |
pdf-10p | PDF, 10-page report, warm pool, concurrency 5 | The standard business-doc workload. |
pdf-100p | PDF, 100-page quarterly, warm pool, concurrency 5 | Scaling with template weight. Shows where per-page cost dominates. |
format-docx | DOCX, 10-page template | No Chromium in the path. Shows the pure template-pipeline cost. |
format-pptx | PPTX, 10-page template | Slide-layout cost. |
format-xlsx | XLSX, 10-page template | Tabular-only cost. |
Two additional dimensions are covered by re-running the script against a stack brought up with different configuration:
- Render mode:
child-process(default, warm pool) vssocket(privilege-separated controller, cold Chromium per render). Requires a 1-instance bring-up with the socket overlay (docker-compose.benchmark.socket.yml) and a child-process companion run at identical harness settings — see operator recipe step 6 for the full procedure. Container mode is excluded this round (docker-cli is not in the benchmark API image). Measures per-render isolation overhead. - HA scaling: 1 / 2 / 4 instances. Bring up only the api services for the current cell with the matching
docker-compose.benchmark.nginx.N.conf, then re-runpdf-10p. Measures throughput-vs-instance scaling.
Cold-start latency is deliberately not an automated cell. The api service eagerly warms its renderer before /health/ready returns OK, so the first sample from a just-ready pod is already a warm render. Measuring genuine cold start requires a manual procedure — see “Cold-start latency” below.
Deliberately out of scope for v1 of this pack: asset-inlining cost, S3-vs-filesystem breakdown at the asset layer, and batch-mode NDJSON-vs-JSON throughput. Tracked in docs/backup-restore-runbook.md and the roadmap.
Methodology
Warmup. Each cell runs --warmup render requests with results discarded before the measured samples begin. The default is 100; the v1 run on this rig used 30 for the five light cells and 2 for pdf-100p (where a single render is ~5 s).
Samples. Each cell captures --samples observations. The default is 500; the v1 run used 100 for the five light cells and 10 for pdf-100p. Larger sample counts tighten the p99 estimate — a future re-run at 500 samples across every cell is listed as a follow-up below.
Concurrency. --concurrency parallel requests against the nginx endpoint. The default is 5.
Measurement. The harness records both the client round-trip wall time and the server-side render duration emitted via the X-Render-Duration-Ms response header (apps/api/src/lib/render-accounting.ts). Percentiles are computed by sorting the per-request durations and picking p50/p95/p99 indices. No PRNGs — dataset generation is deterministic, so re-runs on the same rig yield the same numbers within normal noise.
PDF streaming caveat on output size + page count. The default child-process render mode streams single-document PDF responses. On the streaming path, only X-Render-Duration-Ms is set before the stream starts; X-Render-Size-Bytes and X-Render-Pages are not emitted because neither is known until the stream has finished. Expect — in the Pages / Output columns for PDF cells on the default rig. Non-PDF formats (DOCX / PPTX / XLSX / CSV / HTML) are buffered, so they always return all three headers. This caveat is a property of the render-accounting split at render-accounting.ts:155, not a harness limitation.
Failures. A failure is any non-2xx response. The summary reports failures as N/total so they’re visible. The published numbers are from runs with zero failures; any non-zero failure count invalidates the cell and the rig has to be diagnosed before re-publishing.
Data scales. The three templates are parameterised by row count:
bench-invoice— 8 line items (≈1 page).bench-report— 400 rows (≈10 pages).bench-quarterly— 4000 rows (≈100 pages).
Results — v1 run (2026-04-16), 500-sample refresh (2026-04-17)
First run on the rig above, single-instance cell (1 API replica), RENDER_MODE=child-process. Rate limits raised (RATE_LIMIT_RENDER_MAX=1000000, RATE_LIMIT_MAX=1000000) so the load limiter doesn’t gate measurement. Zero failures across every cell. All numbers reproducible with pnpm bench:seed then pnpm bench:run against the same rig.
PDF cells show — for both Pages and Output because the child-process render path streams single-document PDFs (see “PDF streaming caveat” above); non-PDF formats are buffered and populate all three.
All six cells at 500 samples with zero failures. Harness settings differ by cell weight: the five light cells (pdf-1p, pdf-10p, format-docx, format-pptx, format-xlsx) ran with warmup 100 / concurrency 5; pdf-100p ran with warmup 30 / concurrency 2, since a single 100-page render is ~5 s on this rig and higher concurrency produces pool-contention noise rather than signal. Provenance: bench-results-500-light/ for the five light cells, bench-results-500-heavy/ for pdf-100p.
| Cell | Description | RPS | Client p50 | Client p95 | Client p99 | Server p50 | Server p95 | Server p99 | Pages | Output | Failures |
|---|---|---|---|---|---|---|---|---|---|---|---|
pdf-1p | PDF — 1-page invoice, warm pool | 1.2 | 2104.6 ms | 4144.3 ms | 4150.7 ms | 1.0 ms | 3.0 ms | 4.0 ms | — | — | 0/500 |
pdf-10p | PDF — 10-page report, warm pool | 1.1 | 2238.5 ms | 4413.7 ms | 4432.4 ms | 4.0 ms | 6.0 ms | 8.0 ms | — | — | 0/500 |
pdf-100p | PDF — 100-page quarterly, warm pool | 0.4 | 4558.8 ms | 4891.5 ms | 5002.1 ms | 48.0 ms | 77.0 ms | 83.0 ms | — | — | 0/500 |
format-docx | DOCX — 10-page template | 10.1 | 489.2 ms | 543.8 ms | 553.4 ms | 309.0 ms | 502.0 ms | 532.0 ms | — | 18.4 KB | 0/500 |
format-pptx | PPTX — 10-page template | 22.3 | 212.7 ms | 239.4 ms | 247.9 ms | 135.0 ms | 188.0 ms | 196.0 ms | — | 2.02 MB | 0/500 |
format-xlsx | XLSX — 10-page template | 59.5 | 75.4 ms | 91.5 ms | 109.3 ms | 33.0 ms | 65.0 ms | 75.0 ms | — | 17.5 KB | 0/500 |
Reading the numbers:
- Chromium fixed overhead dominates single PDF renders.
pdf-1pandpdf-10pclient p50 are within 7 % of each other (2.1 s vs 2.2 s) — the marginal cost of 9 extra pages is near zero on a warm pool. The per-page cost only becomes visible atpdf-100p(+2.3 s for the jump from 10 to 100 pages, so roughly 26 ms/page). For transactional 1–10 page workloads the binding constraint is page-pool throughput, not per-page layout cost. - Server-side PDF duration is handoff-only. The
Server p50/p95/p99columns for PDF cells are the time from request-in to stream-start, which is < 10 ms for the light cells and 77 ms for the 100-page cell. The total render wall time lives in the client columns. This is the streaming-accounting split documented in render-accounting.ts; it is not a harness artifact. - Non-PDF formats reveal the pure template pipeline. XLSX is the cleanest signal on this rig — server p50 33 ms with 17.5 KB output. DOCX server p95 is 502 ms vs server p50 309 ms (a 1.6× multiplier at 500 samples, well within normal format-pipeline variance). The earlier 100-sample run reported a much larger 5.6× multiplier (server p95 1628 ms vs server p50 292 ms); re-measuring at 5× the sample count flattened that apparent tail, so the original signal was 100-sample noise rather than a real long-tail pathology.
- PDF throughput is low on this rig. ~1.1–1.2 RPS for 1-page PDF on a single
child-processreplica. Most of this is WSL2 + Docker Desktop overhead on Windows — a native Linux rig will typically show several times this throughput. Use these numbers as a floor for that hardware class, not a prediction for a production deployment.
Saturation model — PDF
The main matrix’s pdf-1p result (1.2 RPS, p50 2104.6 ms, p95 4144.3 ms at concurrency 5) looks at first glance like the pool is running below theoretical ceiling. It is not — it’s saturating exactly, and the arithmetic explains every number.
The harness runs strict batches. scripts/bench/run.mjs fires concurrency requests at once and waits for Promise.all before starting the next batch. So each 5-request batch takes:
batch_wall_time = ceil(concurrency / slots) × p50
RPS = concurrency / batch_wall_time
Default config (4 effective slots): MAX_CONCURRENT_PAGES=5 with RENDER_PREVIEW_RESERVED_SLOTS=1 — the /render path sees a 4-slot batch lane (browser-pool.ts:89). At concurrency 5, four samples land in slots immediately and the fifth waits one full wall time for a slot to free, so batch wall time ≈ 2 × p50 ≈ 4.2 s. Predicted RPS = 5 / 4.2 ≈ 1.19. Predicted client p95 ≈ 2 × p50 because 1 sample in every 5-batch is the slow one — a 4/1 fast/slow split that puts p95 squarely in the “waited one wall time” bucket. Observed: 1.2 RPS, p95 4144 ms ≈ 2 × 2105 ms p50. Matches to within ±1%.
Experimental confirmation at SLOTS=0. Setting RENDER_PREVIEW_RESERVED_SLOTS=0 collapses the preview lane into the batch lane, so /render sees 5 slots. All 5 requests in a batch run simultaneously; batch wall time ≈ 1 × p50; predicted RPS = 5 / p50 ≈ 2.4 and predicted client p95 collapses toward p50. Measured on this rig (1-instance, pdf-1p, warmup 100, 500 samples, 0/500 failures):
| Config | RPS | Client p50 | Client p95 | Shape |
|---|---|---|---|---|
RENDER_PREVIEW_RESERVED_SLOTS=1 (default) | 1.2 | 2104.6 ms | 4144.3 ms | bimodal (4 fast + 1 slow per batch) |
RENDER_PREVIEW_RESERVED_SLOTS=0 | 2.3 | 2177.0 ms | 2202.2 ms | unimodal (all 5 fast per batch) |
RPS roughly doubles; client p95 collapses to within 1% of p50. This is an experimental data point, not a recommendation to run production with RENDER_PREVIEW_RESERVED_SLOTS=0 — that setting trades interactive preview responsiveness (the reserved slot exists so a burst of batch renders can’t starve the live preview surface) for batch throughput. Operators choosing between them should treat this as documentation of the knob’s effect, not a tuning default. See the rollback note in release-v0.65.0.md:60.
Serial baseline sanity check. Run at concurrency 1 (no queueing possible), pdf-1p p50 was 2157.6 ms with p95 = p50 + 8 ms — confirms the per-render wall time against which the saturation arithmetic is anchored. This wall time is where most of the WSL2+Docker overhead lives; a native-Linux rig would push it materially lower.
Render-mode comparison (v0.73.0 framed protocol, 2026-04-18)
pdf-10p at identical harness settings across modes, on a 1-instance rig (so the comparison measures render-mode cost, not queueing). HA scaling is a separate axis — see the HA table below. Both rows produced in the same session; images are the v0.73.0 framed-protocol builds (pulp-engine:bench-head, pulp-engine-worker:bench-head, pulp-engine-controller:bench-head) rebuilt from the current tree.
Harness: pdf-10p, warmup 5, samples 30, concurrency 2. Sample count is intentionally light because socket mode pays a cold-Chromium-per-render tax (no warm pool — each render spawns a fresh worker container) and 30 samples at ~4 s each is already ~2 minutes of wall time per mode.
| Mode | RPS | Client p50 | Client p95 | Client p99 | Server p50 (TTFB) | Server p95 | Server p99 | Failures |
|---|---|---|---|---|---|---|---|---|
child-process | 0.9 | 2195 ms | 2210 ms | 2213 ms | 5 ms | 7 ms | 10 ms | 0/30 |
socket | 0.5 | 3790 ms | 3950 ms | 3969 ms | 5 ms | 8 ms | 9 ms | 0/30 |
Reading the numbers:
- Socket is ~1.7× slower than child-process on client round-trip (3790 vs 2195 ms p50). The delta is Chromium cold-start: socket mode spawns a fresh worker container per render, so every sample pays the full Puppeteer bootstrap cost. Child-process mode reuses a warm Puppeteer pool across renders.
- Server p50 is essentially identical at ~5 ms in both modes. Both paths stream PDF bytes end-to-end and write
X-Render-Duration-Msat byte-handoff time, so the server number is time-to-first-byte for both. Zero divergence there — this is what v0.73.0’s framed protocol was built for. - Zero failures across 60 samples. The pre-v0.73 JSON-with-base64 envelope historically failed on this rig because Docker Desktop for Windows caps
docker run -istdout at 65 536 bytes (see “Why the previous run was parked” below). The v0.73.0 framed binary protocol — per-frame ≤ 16 MiB, active-reader-per-frame — runs cleanly on the same rig. This confirms the cap was per-write / pipe-buffer rather than cumulative, and the framed protocol sidesteps it. - Throughput at concurrency 2 on 1 instance is bounded by single-render wall time, not the browser pool.
RPS ≈ concurrency / p50arithmetic: child-process 2/2.195 ≈ 0.9, socket 2/3.790 ≈ 0.5. Both rows match prediction within variance.
container render mode remains excluded: the current benchmark API image does not ship the Docker CLI, and container-render-dispatcher.ts shells out to docker from inside the API process. Adding the CLI is a separate image-variant initiative. socket mode provides the same isolation story via a privilege-separated controller process that already has the Docker CLI.
Harness note — undici fetch + dispatcher from the same package. The harness imports both fetch and Agent from undici so the dispatcher and the fetch implementation come from the same package version. Node 24 ships undici 7 internally; the workspace pins undici@^8. Passing an undici@8 Agent as dispatcher: to node’s global fetch triggers UND_ERR_INVALID_ARG: invalid onRequestStart method because the two versions disagree on the interceptor ABI. Fixed in run.mjs:59 — the explicit per-origin connection cap (load-bearing for high-concurrency cells) is preserved. The 2026-04-18 rerun measured above ran with the fixed harness.
Why the previous run was parked
The earlier F1b session in this spot could not produce socket-mode numbers on this Windows+WSL2 rig because Docker Desktop for Windows hard-capped docker run -i stdout at exactly 65 536 bytes in the pre-v0.73 JSON-with-base64 worker→controller envelope. Every PDF above ~64 KiB was truncated mid-base64, the controller returned engine_crash: Container produced unparseable output: ..., and the client saw HTTP 422. The v0.73.0 framed binary protocol retired that envelope — each CHUNK frame is ≤ 16 MiB with its own length prefix, and the controller drains the pipe per frame rather than waiting for EOF. The 2026-04-18 rerun above confirms the fix works on this exact rig.
RENDER_MODE=socket is the docs’ recommended containerized-isolation path (deployment-guide.md:596) — appropriate when the extra privilege separation between the API process and the Docker daemon is specifically required. The HA reference still points at child-process as the default starting point (ha-reference-architecture.md:102) unless socket’s privilege separation is load-bearing for the deployment’s threat model; the ~1.7× latency hit measured above is the trade-off operators are paying for that property.
The F1b session that produced the original parked row also landed two incidental bug fixes driven by the attempt to run the comparison:
3ad232a—RENDER_PREVIEW_RESERVED_SLOTSwas documented as a rollback knob but silently non-functional inRENDER_MODE=child-processbecause the worker fork inherited only three explicit env vars. Now forwarded correctly.5812f40— therender-controller.tsentry-point guard used strict-equality path comparison, which failed on pnpm-deployed bundles (the productioncompose.container.yamldeploy layout) because Node’s ESM loader resolvesimport.meta.urlthrough pnpm’s symlink whileprocess.argv[1]stays at the symlinked path. The controller exited with code 0 and no logs, never starting the server. Now normalised withrealpathSync.
Follow-ups. F4 (native Linux rig) remains the right path to eliminate the WSL2 variable and produce numbers from a rig class closer to typical production deployments. Container-mode numbers are still pending the CLI-in-image work.
HA scaling (2026-04-16)
pdf-10p at concurrency 5, 100 samples, 30 warmup. All three rows produced in the same session so the scaling curve is free from rig-consistency drift between runs. Sample-count note: these HA rows are at 100 samples from the F1a run; the main matrix above is at 500 samples from the F2 re-run. The 1-instance HA row and the main-matrix pdf-10p row measure the same topology at different sample counts — they should agree within variance but are not produced by the same invocation.
| Instances | Nginx config | RPS | Client p50 | Client p95 | Client p99 | Server p50 | Server p95 | Failures |
|---|---|---|---|---|---|---|---|---|
| 1 | docker-compose.benchmark.nginx.1.conf | 1.2 | 2237.2 ms | 4394.2 ms | 4414.0 ms | 4.0 ms | 6.0 ms | 0/100 |
| 2 | docker-compose.benchmark.nginx.2.conf | 2.4 | 2213.5 ms | 2256.8 ms | 2294.8 ms | 5.0 ms | 9.0 ms | 0/100 |
| 4 | docker-compose.benchmark.nginx.4.conf | 2.3 | 2216.3 ms | 2242.9 ms | 2282.9 ms | 6.0 ms | 12.0 ms | 0/100 |
Reading the curve:
- 1 → 2 instances is linear. RPS doubles (1.2 → 2.4) and client p95 roughly halves (4394 → 2257 ms). The 1-instance p95 was dominated by queue wait for one of five concurrent requests to get a render slot; adding a second replica absorbs the queue.
- 2 → 4 instances plateaus. At client concurrency 5, only five requests can be in flight at once; two replicas’ worth of page-pool capacity is already enough to serve them without queueing. Adding a third and fourth replica has nothing to do while the client keeps concurrency flat. The ~0.05 RPS drop from 2 → 4 is noise — 4-instance p50 (2216 ms) is within a single sample of 2-instance (2213 ms).
- To saturate more replicas you need more client concurrency. A production workload driving concurrency 10 or 20 would show the 4-instance cell materially outperforming the 2-instance cell. The dedicated higher-concurrency follow-up that sweeps
--concurrencyfrom 5 up to 32 at the 4-instance cell is documented in the### Higher-concurrency HA sweep (4-instance)subsection below and confirms the 2→4 plateau at concurrency 5 was client-side-concurrency-bound, not pool-saturation-bound. - Sanity check against the published 1-instance baseline (1.1 RPS / 4842 ms p95 on 2026-04-16 main-matrix run): the fresh 1-instance HA row here (1.2 RPS / 4394 ms p95) is within normal variance. No surprises.
Higher-concurrency HA sweep (4-instance, 2026-04-17)
pdf-10p at 4-instance topology (docker-compose.benchmark.nginx.4.conf, api1..api4), warmup 30, samples 160, five cells varying only --concurrency. Sample count is the LCM of the five concurrency values so every cell runs tail-batch-free (160 / C is always an integer). Zero failures across all 800 samples.
The “naive-ceiling RPS” column is the F3 saturation-model prediction assuming per-render wall time stays at the C=5 p50 (2.34 s) regardless of concurrent load: with T = 16 total batch slots (4 instances × 4 slots), RPS_ceiling = C / (ceil(C / 16) × 2.34). The gap between that ceiling and the measured RPS is the key finding of this session.
| Concurrency | Measured RPS | Naive-ceiling RPS | Client p50 | Client p95 | Client p99 | Failures |
|---|---|---|---|---|---|---|
| 5 | 2.1 | 2.14 | 2343.5 ms | 2392.8 ms | 2413.8 ms | 0/160 |
| 10 | 4.1 | 4.27 | 2408.8 ms | 2462.6 ms | 2488.7 ms | 0/160 |
| 16 | 4.3 | 6.84 | 2517.2 ms | 2601.8 ms | 4775.8 ms | 0/160 |
| 20 | 4.1 | 4.27 | 2542.6 ms | 4815.6 ms | 4879.9 ms | 0/160 |
| 32 | 4.7 | 6.84 | 4698.9 ms | 5047.5 ms | 7232.0 ms | 0/160 |
Reading the curve:
- C=5 cleanly matches the ceiling (measured 2.1 / predicted 2.14 RPS, within 2%). At 5 in-flight requests across 16 total slots the pool is deeply under-subscribed, every sample renders serially in its own slot, and per-render wall time is the same as F3’s serial baseline. This also reconfirms the F1a 4-instance HA row (2.3 RPS at concurrency 5) within normal variance, and resolves the F1a “2→4 plateau” reading: the plateau was a client-side-concurrency artefact, not pool saturation.
- C=10 is still ceiling-bound (measured 4.1 / predicted 4.27 RPS, within 4%). Per-render wall time stays at ~2.4 s — 10 parallel Chromiums on a 24-core host barely contend.
- C=16 plateaus at ~4.3 RPS, well below the ceiling’s 6.84 (measured is 63% of naive prediction). Per-render p50 and p95 stay tight (2517 / 2602 ms — no bimodal tail), so the gap is NOT queueing: it is host CPU contention. With 16 concurrent Chromium page renders across 4 API processes, the i9-13900KF’s 8 P-cores + 16 E-cores on WSL2+Docker Desktop cannot sustain all 16 at single-render speed; each render stretches. The
RPS = C / p50arithmetic still holds if you use the contention-stretched p50, but the F3 model’s “p50 is constant” assumption breaks down here. - C=20 shows the classic queueing signature on top of saturation (measured 4.1 / predicted 4.27 RPS). RPS matches prediction because the arithmetic model does not assume more per-render slowdown between C=16 and C=20 (same 16-slot ceiling). Client p95 almost exactly doubles to 4816 ms — the 4 over-the-16 samples wait one full wall time on the second wave. Bimodal as predicted.
- C=32 stays on the plateau at ~4.7 RPS, slightly above C=20 because each batch now fully fills two waves rather than leaving 12 slots idle on the second pass. Client p50 (4699 ms) is almost exactly 2 × the C=5 p50 — consistent with half the samples running immediately and half queuing for a second wave (the median sample is in the queued half). Second peak of the F3 sawtooth, modulated by contention.
Actual 4-instance peak throughput on this rig: ~4.5 RPS at any concurrency ≥ 16, plateau regardless of how much higher C goes. That is ~2× the F1a 2-instance row at concurrency 5 (2.4 RPS), so 4 instances do deliver meaningfully more throughput than 2 instances under saturation — just not the 2× bump the naive arithmetic would predict. The binding constraint above C=16 on this hardware class is host CPU, not API pool or browser pool.
Caveat on the ceiling comparison. The naive-ceiling column uses the C=5 p50 (2.34 s) as a contention-free reference. If the comparison used C=16’s own p50 (2.52 s) instead, the “ceiling” at C=16 drops to 6.35 RPS — still 50%+ above the 4.3 measurement. The CPU-contention effect shows up as both a reduced effective slot count AND per-slot slowdown, which the model cannot cleanly separate without a deeper rig-level CPU trace. For published-pack purposes, the C=5-anchored ceiling is the simpler upper bound, and the gap (~4.3 vs 6.8 at C=16) is the actionable data point for operators: on this host class, max 4-instance throughput on pdf-10p is ~4.5 RPS, not the pool-arithmetic 7.1.
Cold-start latency
The automated cells all measure warm-pool latency because the api service calls warmBrowser() before /health/ready returns OK (server.ts). To measure a genuine cold start including Chromium bootstrap:
# Restart one api container between samples and time the first render.
docker compose -f docker-compose.benchmark.yml restart api1
# Wait for /health/ready via the bb sidecar.
while ! docker compose -f docker-compose.benchmark.yml exec -T bb \
wget -q --spider http://api1:3000/health/ready; do sleep 0.2; done
# Issue a single render request and capture the server-reported duration.
curl -s -H "X-Api-Key: $PULP_ENGINE_RENDER_KEY" \
-H "Content-Type: application/json" \
-d '{"template":"bench-invoice","data":{"invoiceNumber":"cold","customerName":"cold","items":[{"description":"line","quantity":1,"unitPrice":10}]}}' \
-D - -o /dev/null \
http://localhost:3000/render | grep -i '^x-render-duration-ms:'
The reported number for a fresh docker compose restart is typically much higher than the pdf-1p warm p50 because Chromium has to load. Repeat five times and publish the median. This procedure is manual because scripting it reliably across platforms (WSL, macOS, Linux hosts with different wget / curl semantics) is more error-prone than the procedure itself is useful.
Reproducing this on your hardware
The harness does not depend on “the rig” — it runs anywhere Docker runs. Follow the steps below to get your own numbers.
1. Set benchmark env vars
Create .env.benchmark at the repo root with matching API keys and a token secret:
API_KEY_ADMIN=dk_admin_bench_$(openssl rand -hex 8)
API_KEY_RENDER=dk_render_bench_$(openssl rand -hex 8)
EDITOR_TOKEN_SECRET=$(openssl rand -hex 32)
PULP_ENGINE_IMAGE=ghcr.io/troycoderboy/pulp-engine:v0.72.0 # pin to the version under test
PULP_ENGINE_BENCH_NGINX_CONF=./docker-compose.benchmark.nginx.4.conf
2. Bring up the rig (4-instance cell)
# Infrastructure first.
docker compose -f docker-compose.benchmark.yml --env-file .env.benchmark \
up -d postgres minio minio-setup prometheus bb
# Wait for minio-setup to exit cleanly.
docker compose -f docker-compose.benchmark.yml ps minio-setup # STATUS: exited (0)
# Then the api replicas (migrate runs once, auto).
docker compose -f docker-compose.benchmark.yml --env-file .env.benchmark \
up -d api1 api2 api3 api4
# Wait for readiness via the bb sidecar (runtime image has no curl — we
# use the sidecar's wget).
for i in 1 2 3 4; do
while ! docker compose -f docker-compose.benchmark.yml exec -T bb \
wget -q --spider "http://api${i}:3000/health/ready"; do sleep 1; done
echo "api${i}: ready"
done
# Finally nginx.
docker compose -f docker-compose.benchmark.yml --env-file .env.benchmark \
up -d nginx
# Confirm Prometheus sees every api.
curl -s 'http://localhost:9090/api/v1/query?query=count(up==1)' \
| jq '.data.result[0].value[1] | tonumber' # expect: 4
3. Seed the templates
PULP_ENGINE_URL=http://localhost:3000 \
PULP_ENGINE_ADMIN_KEY="$API_KEY_ADMIN" \
node scripts/bench/seed-templates.mjs
The seeder is idempotent: on re-run it skips any template key that already exists. If you edit a bench template JSON and want the change picked up, delete the old template first, then re-run the seeder. The delete route uses optimistic concurrency via If-Match, so it’s a two-step flow — fetch the current version, then delete with that version as the If-Match value:
ETAG=$(curl -sI -H "X-Api-Key: $API_KEY_ADMIN" \
"$PULP_ENGINE_URL/templates/bench-invoice" \
| awk -F'"' '/^etag:/ { print $2 }')
curl -X DELETE \
-H "X-Api-Key: $API_KEY_ADMIN" \
-H "If-Match: \"$ETAG\"" \
"$PULP_ENGINE_URL/templates/bench-invoice"
The GET response’s ETag header quotes the current template version (apps/api/src/routes/templates/index.ts). Omitting If-Match on the DELETE returns 428 Precondition Required; passing a stale value returns 412 Precondition Failed.
4. Run the harness
PULP_ENGINE_URL=http://localhost:3000 \
PULP_ENGINE_RENDER_KEY="$API_KEY_RENDER" \
node scripts/bench/run.mjs --out=./bench-results
Produces ./bench-results/bench-results.csv (raw per-request rows) and ./bench-results/bench-results.md (percentile summary). Copy the summary into this doc, alongside your hardware description.
5. Run the HA-scaling cells
Tear down (docker compose -f docker-compose.benchmark.yml down -v), swap PULP_ENGINE_BENCH_NGINX_CONF in .env.benchmark to the 1-, 2-, or 4-instance config, bring up only the api services required for that cell (docker compose up -d api1 for 1-instance, api1 api2 for 2-instance, api1 api2 api3 api4 for 4-instance), wait for each api’s /health/ready via the bb sidecar, bring up nginx, and re-run --cells=pdf-10p. Verify count(up==1) in Prometheus matches the expected replica count before trusting the results.
6. Run the render-mode cell (socket only, this round)
Render-mode comparison is a 1-instance experiment — so the comparison measures render-mode cost, not queueing. container mode is deliberately out of this round (the current benchmark API image does not ship the Docker CLI; container-render-dispatcher.ts shells out to docker from inside the API process). socket mode provides the isolation story via a privilege-separated controller process — the docs’ recommended containerized-isolation path (deployment-guide.md:596), not an unqualified production default; child-process remains the default starting point per ha-reference-architecture.md:102 unless the extra privilege separation is specifically required.
On Windows Docker Desktop rigs this recipe was previously blocked by a 64-KiB cap on docker run -i stdout that truncated every PDF above that size in the pre-v0.73 JSON-with-base64 worker↔controller protocol. v0.73.0 replaced that envelope with framed binary stdio whose reader drains the pipe per frame — plausibly removing the blocker, but not yet re-verified on a WSL2 rig. See the “Render-mode comparison” block above for the full finding. The recipe is preserved here for Linux rigs and as the WSL2 re-verification target; F4 (native Linux run) remains the cleanest path to produce the numbers without the WSL2 variable.
Prerequisites (both are local image builds; first build can take 10–15 min each):
docker build -f Dockerfile.worker -t pulp-engine-worker:bench-head .
docker build -f Dockerfile.controller -t pulp-engine-controller:bench-head .
Socket-mode bring-up uses an overlay compose file that adds a render-controller service and a tmpfs-backed socket volume on top of the base benchmark stack. Always use the 1-instance nginx config and only bring up api1 — a render-mode row produced against a 4-instance topology would conflate render-mode cost with queueing regime.
docker compose -f docker-compose.benchmark.yml down -v
# Set PULP_ENGINE_BENCH_NGINX_CONF=./docker-compose.benchmark.nginx.1.conf in .env.benchmark
docker compose -f docker-compose.benchmark.yml -f docker-compose.benchmark.socket.yml \
--env-file .env.benchmark up -d postgres minio minio-setup prometheus bb render-controller
# Wait for render-controller healthy.
docker compose -f docker-compose.benchmark.yml -f docker-compose.benchmark.socket.yml \
--env-file .env.benchmark up -d api1
# Wait for bb-sidecar /health/ready on api1.
docker compose -f docker-compose.benchmark.yml -f docker-compose.benchmark.socket.yml \
--env-file .env.benchmark up -d nginx
# Run the socket benchmark at low samples/concurrency because each render is cold Chromium.
pnpm bench:run --cells=pdf-10p --warmup=5 --samples=30 --concurrency=2 \
--out=./bench-results-socket
Then run a child-process companion at the same settings (warmup 5 / samples 30 / concurrency 2) so the published comparison table is apples-to-apples. Tear down, drop the socket overlay, bring up api1 again under the default docker-compose.benchmark.yml, re-run pnpm bench:run --cells=pdf-10p --warmup=5 --samples=30 --concurrency=2 --out=./bench-results-socket-companion. The comparison table then pairs the two rows produced at identical settings; the procurement-facing child-process row at the richer settings stays in the main matrix.
docker-compose.benchmark.socket.yml is committed at the repo root and references by this recipe resolve today. The socket-mode comparison numbers are still pending — this section remains the operator recipe for when the first clean run lands.
6. Clean up
docker compose -f docker-compose.benchmark.yml down -v
The -v also removes the postgres-bench, minio-bench, and prometheus-bench volumes.
Notes on interpretation
Server vs client percentiles. The client p95 includes network I/O, HMAC verification, response buffering, and nginx proxying. The server p95 is just the render pipeline. A large gap (> 30 %) usually points at the network layer (typical on loopback: single-digit %). On real networks expect the client-side number to drift further away.
Page-count scaling is not linear. Chromium has a roughly constant overhead per page (layout + paint) plus a fixed bootstrap. Don’t extrapolate from pdf-10p to estimate pdf-1000p — measure it directly with a corresponding template.
Concurrency ceiling. BATCH_CONCURRENCY (default 5) caps the per-instance page pool. At higher sustained load the bottleneck becomes Chromium page-context contention; the HA-scaling cell shows the practical throughput ceiling before adding pods.
HARDEN_PRODUCTION=false. The benchmark rig runs with HARDEN_PRODUCTION=false so the startup posture check doesn’t gate the stack. Production deployments must not; see docs/deployment-guide.md.
Versioning this pack
The numbers above are pinned to the Pulp Engine image named in the “Hardware” section. When a new version of the product lands, re-run the harness against the new image on the same rig and replace the tables. Keep historic runs as docs/benchmark-pack-vX.Y.Z.md if you want a trend record — the main document always reflects the latest release.