bug(ocr): /tmp tmpfs too small for Surya model download — guided OCR fails with ENOSPC on staging #614

New Issue

marcel · 2026-05-18T10:38:57+02:00

marcel commented

2026-05-18 10:38:57 +02:00

Context

After the OCR security hardening landed (commits 1aca4c4a, ab24786d, fc8b4b16, f1e0b92f), the OCR container in staging (archiv-staging-ocr-service-1) cannot complete a guided OCR request.

Two distinct failures have surfaced. The first — PermissionError: '/app/cache/datalab' — was fixed today by chowning the pre-existing archiv-staging_ocr-cache and archiv-staging_ocr-models Docker volumes from 0:0 → 1000:1000 (the volumes pre-dated the non-root ocr user introduced in 1aca4c4a).

This issue tracks the second failure, which only became visible after the chown unblocked the first.

Symptom

First guided OCR call after a clean restart triggers Surya to download text_recognition/2025_09_23/model.safetensors (1.34 GB). The download fails at ~510 MB, retried 3× by Surya, each time with:

OSError: [Errno 28] No space left on device
  File "/usr/local/lib/python3.11/site-packages/surya/common/s3.py", line 55, in download_file
    f.write(chunk)

The host (raddatz.cloud, /dev/md2) has 1.8 TB free. /app/cache and /app/models are both on that disk. The space exhausted is somewhere else.

Root cause

surya/common/s3.py:95 — download_directory stages every file into a tempfile.TemporaryDirectory() and shutil.moves to the cache afterwards:

with tempfile.TemporaryDirectory() as temp_dir:
    ...
    for file in manifest["files"]:
        local_file = os.path.join(temp_dir, file)
        futures.append(executor.submit(download_file, remote_file, local_file))
    ...
    # Move all files to new directory
    for file in os.listdir(temp_dir):
        shutil.move(os.path.join(temp_dir, file), local_dir)

tempfile.TemporaryDirectory() honours $TMPDIR and otherwise falls back to /tmp. The OCR container declares:

read_only: true
tmpfs:
  - /tmp:size=512m   # training endpoints write ZIPs to /tmp; 512 MB covers typical batches (20–50 images)

TMPDIR is unset → staging path resolves to the 512 MB /tmp tmpfs → 1.34 GB safetensors blows the budget at ~510 MB. The comment on that line was written before Surya was introduced; nobody re-evaluated /tmp sizing for model downloads.

Inside the running container:

tmpfs           512M  8.0K  512M   1% /tmp
/dev/md2        2.0T   44G  1.8T   3% /app/cache
/dev/md2        2.0T   44G  1.8T   3% /app/models

Approaches under consideration

Pick exactly one. Trade-offs are real — please choose deliberately.

Approach A — Redirect `TMPDIR` to the disk-backed cache volume (recommended)

Add a new TMPDIR=/app/cache/.tmp env var to the ocr-service block, and ensure the directory exists with ocr:ocr ownership. The /tmp tmpfs stays at 512 MB and continues to absorb training ZIPs (which are small and benefit from being in RAM).

ocr-service:
  environment:
    HF_HOME: /app/cache
    XDG_CACHE_HOME: /app/cache
    TORCH_HOME: /app/models/torch
    TMPDIR: /app/cache/.tmp           # NEW
    # ... rest unchanged
  tmpfs:
    - /tmp:size=512m                  # unchanged

Pros

No RAM cost. The Surya model stages on the SSD where the final cached copy will live, then shutil.move becomes a same-filesystem rename(2) — atomic and near-free.
Training-ZIP path keeps its fast RAM-backed /tmp (the original intent of the 512 MB sizing).
Self-documenting: separating download scratch from job scratch matches their actual lifecycles.

Cons

Requires the .tmp subdirectory to exist with correct ownership before first use. Two ways to guarantee that:
1. Add a mkdir -p "$TMPDIR" && chown step to entrypoint.sh — but the container runs as ocr and read_only: true, so mkdir is OK on the volume mount but chown would fail. Safer: just mkdir -p "$TMPDIR" (the parent volume is already owned by ocr after today's chown, so the new dir inherits correctly).
2. Pre-create via Dockerfile (RUN mkdir -p /app/cache/.tmp && chown ocr:ocr /app/cache/.tmp) — but /app/cache is volume-shadowed at runtime, so this only works on a fresh volume. Existing volumes need the entrypoint variant.
If the container is docker killed mid-download, the TemporaryDirectory context manager doesn't get to clean up — leaves orphaned files in .tmp. Mitigation: add find "$TMPDIR" -mtime +1 -delete to entrypoint, or accept it (the bytes are small compared to the 1.8 TB available).

Approach B — Enlarge `/tmp` tmpfs to 4 GB

One-line change:

tmpfs:
  - /tmp:size=4g    # was 512m

Pros

Trivial diff. No env var, no entrypoint changes, no Dockerfile changes.
Works on existing volumes without any migration.

Cons

The 4 GB comes out of host RAM whenever staging is in use. The OCR container's mem_limit: 12g is set on the cgroup, but tmpfs accounting depends on host config — in the worst case, a 1.34 GB download sits in host RAM, then shutil.move to /app/cache copies it (cross-filesystem move = read + write + delete) consuming briefly ~2.7 GB total RAM.
/tmp and "model download staging" share a budget. A future model 2× larger requires bumping again.
Less aligned with the intent of read-only-rootfs + tmpfs hardening: tmpfs is meant for ephemeral small writes, not gigabyte staging.

Approach C — Apply both

TMPDIR=/app/cache/.tmp and /tmp:size=1g. Belt-and-suspenders: any future code that ignores $TMPDIR and writes directly to /tmp still has 1 GB of headroom for typical work (training ZIPs, transient PDF buffers).

Pros

Defense in depth — Approach B as a safety net behind Approach A.
1 GB tmpfs is still cheap; 4 GB would be excessive given Approach A removes the primary consumer.

Cons

Two changes instead of one; slightly more compose churn.
1 GB RAM reservation per container, even when idle (tmpfs only counts pages actually touched, but the limit reserves address space).

Acceptance criteria

Guided OCR end-to-end succeeds on staging after a fresh container start (Surya text_recognition model fully downloads to /app/cache/datalab/models/text_recognition/2025_09_23/ and inference runs).
Training endpoint (/train-sender, see main.py) still works — ZIPs of 20-50 page batches succeed.
Container retains read_only: true and cap_drop: [ALL].
Container retains non-root ocr user (uid 1000).
docker-compose.yml in the repo and the deployed compose at /opt/familienarchiv/docker-compose.yml on the server are kept in sync (CI/CD redeploy is the canonical sync path — no manual edits to the server file).
On a fresh ocr_cache volume (no pre-chown), the chosen approach still works (relevant for new environments + the eventual repaving of the staging volume).

Test plan

Local: docker compose up -d ocr-service, then trigger a guided OCR call against a document with no cached Surya model. Confirm download completes (~10 s on home connection) and OCR returns text.
Local: trigger /train-sender with a sample ZIP — confirm training endpoint still functional.
Staging dry run: SSH to raddatz.cloud, docker volume rm archiv-staging_ocr-cache (force fresh-volume path), docker compose ... up -d ocr-service, repeat #1 against staging.
Verify /app/cache/.tmp ownership inside the running container: docker exec ... ls -ldn /app/cache/.tmp should show ocr:ocr (uid 1000).
Stress test (only if Approach B/C chosen): chain three 1.34 GB downloads concurrently to confirm tmpfs limit is not the bottleneck.

Files to touch

docker-compose.yml — ocr-service.environment (Approach A or C) and/or ocr-service.tmpfs (Approach B or C). Update the inline comment on the tmpfs line to reflect the new sizing rationale.
ocr-service/entrypoint.sh — Approach A or C: prepend mkdir -p "${TMPDIR:-/tmp}" so a fresh ocr_cache volume gets the directory on first start. Idempotent; safe on already-populated volumes.
ocr-service/CLAUDE.md and/or README.md — one-line note about the TMPDIR convention (Approach A or C only) so future contributors don't undo it.

Not touched:

Dockerfile — /app/cache/.tmp cannot be baked in because the volume mount shadows it.
docker-compose.prod.yml / staging overlays — env var is already inherited from the base compose.

Out of scope (separate tickets)

Kraken german_kurrent.mlmodel is missing from archiv-staging_ocr-models. Pre-existing gap; Surya works without it, but Kurrent-specific OCR is disabled until the model is seeded. File a separate issue.
Today's manual chown -R 1000:1000 of archiv-staging_ocr-cache and archiv-staging_ocr-models worked but is not reproducible from version control. A separate hardening ticket should make volume ownership automatic — e.g. via a one-shot init container or a volume-bootstrap step in CI/CD.
ADR or commit comment explaining the TMPDIR convention if Approach A/C is chosen — author's call; not required for merge.

1aca4c4a security(ocr): add non-root user and set HOME/HF_HOME in Dockerfile
ab24786d security(ocr): harden compose — fix cache volume path, add read_only + cap_drop
fc8b4b16 security(ocr): redirect XDG cache and Torch home away from read-only HOME
f1e0b92f style(ocr): normalize cap_drop to block notation in docker-compose.yml
Manual fix applied today on raddatz.cloud (2026-05-18): docker stop archiv-staging-ocr-service-1 && docker run --rm -v archiv-staging_ocr-cache:/c alpine chown -R 1000:1000 /c && docker run --rm -v archiv-staging_ocr-models:/m alpine chown -R 1000:1000 /m && docker start archiv-staging-ocr-service-1. Volumes now correctly owned by uid 1000; service healthy; this issue is what surfaces next.

## Context After the OCR security hardening landed (commits `1aca4c4a`, `ab24786d`, `fc8b4b16`, `f1e0b92f`), the OCR container in **staging** (`archiv-staging-ocr-service-1`) cannot complete a guided OCR request. Two distinct failures have surfaced. The first — `PermissionError: '/app/cache/datalab'` — was fixed today by chowning the pre-existing `archiv-staging_ocr-cache` and `archiv-staging_ocr-models` Docker volumes from `0:0` → `1000:1000` (the volumes pre-dated the non-root `ocr` user introduced in `1aca4c4a`). **This issue tracks the second failure**, which only became visible after the chown unblocked the first. ## Symptom First guided OCR call after a clean restart triggers Surya to download `text_recognition/2025_09_23/model.safetensors` (1.34 GB). The download fails at ~510 MB, retried 3× by Surya, each time with: ``` OSError: [Errno 28] No space left on device File "/usr/local/lib/python3.11/site-packages/surya/common/s3.py", line 55, in download_file f.write(chunk) ``` The host (`raddatz.cloud`, `/dev/md2`) has 1.8 TB free. `/app/cache` and `/app/models` are both on that disk. The space exhausted is somewhere else. ## Root cause `surya/common/s3.py:95` — `download_directory` stages every file into a `tempfile.TemporaryDirectory()` and `shutil.move`s to the cache afterwards: ```python with tempfile.TemporaryDirectory() as temp_dir: ... for file in manifest["files"]: local_file = os.path.join(temp_dir, file) futures.append(executor.submit(download_file, remote_file, local_file)) ... # Move all files to new directory for file in os.listdir(temp_dir): shutil.move(os.path.join(temp_dir, file), local_dir) ``` `tempfile.TemporaryDirectory()` honours `$TMPDIR` and otherwise falls back to `/tmp`. The OCR container declares: ```yaml read_only: true tmpfs: - /tmp:size=512m # training endpoints write ZIPs to /tmp; 512 MB covers typical batches (20–50 images) ``` `TMPDIR` is unset → staging path resolves to the 512 MB `/tmp` tmpfs → 1.34 GB safetensors blows the budget at ~510 MB. The comment on that line was written before Surya was introduced; nobody re-evaluated `/tmp` sizing for model downloads. Inside the running container: ``` tmpfs 512M 8.0K 512M 1% /tmp /dev/md2 2.0T 44G 1.8T 3% /app/cache /dev/md2 2.0T 44G 1.8T 3% /app/models ``` ## Approaches under consideration Pick exactly one. Trade-offs are real — please choose deliberately. ### Approach A — Redirect `TMPDIR` to the disk-backed cache volume (recommended) Add a new `TMPDIR=/app/cache/.tmp` env var to the `ocr-service` block, and ensure the directory exists with `ocr:ocr` ownership. The `/tmp` tmpfs stays at 512 MB and continues to absorb training ZIPs (which are small and benefit from being in RAM). ```yaml ocr-service: environment: HF_HOME: /app/cache XDG_CACHE_HOME: /app/cache TORCH_HOME: /app/models/torch TMPDIR: /app/cache/.tmp # NEW # ... rest unchanged tmpfs: - /tmp:size=512m # unchanged ``` **Pros** - No RAM cost. The Surya model stages on the SSD where the final cached copy will live, then `shutil.move` becomes a same-filesystem `rename(2)` — atomic and near-free. - Training-ZIP path keeps its fast RAM-backed `/tmp` (the original intent of the 512 MB sizing). - Self-documenting: separating *download scratch* from *job scratch* matches their actual lifecycles. **Cons** - Requires the `.tmp` subdirectory to exist with correct ownership before first use. Two ways to guarantee that: 1. Add a `mkdir -p "$TMPDIR" && chown` step to `entrypoint.sh` — but the container runs as `ocr` and `read_only: true`, so `mkdir` is OK on the volume mount but `chown` would fail. Safer: just `mkdir -p "$TMPDIR"` (the parent volume is already owned by `ocr` after today's chown, so the new dir inherits correctly). 2. Pre-create via `Dockerfile` (`RUN mkdir -p /app/cache/.tmp && chown ocr:ocr /app/cache/.tmp`) — but `/app/cache` is volume-shadowed at runtime, so this only works on a *fresh* volume. Existing volumes need the entrypoint variant. - If the container is `docker kill`ed mid-download, the `TemporaryDirectory` context manager doesn't get to clean up — leaves orphaned files in `.tmp`. Mitigation: add `find "$TMPDIR" -mtime +1 -delete` to entrypoint, or accept it (the bytes are small compared to the 1.8 TB available). ### Approach B — Enlarge `/tmp` tmpfs to 4 GB One-line change: ```yaml tmpfs: - /tmp:size=4g # was 512m ``` **Pros** - Trivial diff. No env var, no entrypoint changes, no Dockerfile changes. - Works on existing volumes without any migration. **Cons** - The 4 GB comes out of *host RAM* whenever staging is in use. The OCR container's `mem_limit: 12g` is set on the cgroup, but tmpfs accounting depends on host config — in the worst case, a 1.34 GB download sits in host RAM, then `shutil.move` to `/app/cache` copies it (cross-filesystem move = read + write + delete) consuming briefly ~2.7 GB total RAM. - `/tmp` and "model download staging" share a budget. A future model 2× larger requires bumping again. - Less aligned with the intent of read-only-rootfs + tmpfs hardening: tmpfs is meant for ephemeral *small* writes, not gigabyte staging. ### Approach C — Apply both `TMPDIR=/app/cache/.tmp` **and** `/tmp:size=1g`. Belt-and-suspenders: any future code that ignores `$TMPDIR` and writes directly to `/tmp` still has 1 GB of headroom for typical work (training ZIPs, transient PDF buffers). **Pros** - Defense in depth — Approach B as a safety net behind Approach A. - 1 GB tmpfs is still cheap; 4 GB would be excessive given Approach A removes the primary consumer. **Cons** - Two changes instead of one; slightly more compose churn. - 1 GB RAM reservation per container, even when idle (tmpfs only counts pages actually touched, but the limit reserves address space). ## Acceptance criteria - [ ] Guided OCR end-to-end succeeds on staging after a fresh container start (Surya text_recognition model fully downloads to `/app/cache/datalab/models/text_recognition/2025_09_23/` and inference runs). - [ ] Training endpoint (`/train-sender`, see `main.py`) still works — ZIPs of 20-50 page batches succeed. - [ ] Container retains `read_only: true` and `cap_drop: [ALL]`. - [ ] Container retains non-root `ocr` user (uid 1000). - [ ] `docker-compose.yml` in the repo and the deployed compose at `/opt/familienarchiv/docker-compose.yml` on the server are kept in sync (CI/CD redeploy is the canonical sync path — no manual edits to the server file). - [ ] On a *fresh* `ocr_cache` volume (no pre-chown), the chosen approach still works (relevant for new environments + the eventual repaving of the staging volume). ## Test plan 1. Local: `docker compose up -d ocr-service`, then trigger a guided OCR call against a document with no cached Surya model. Confirm download completes (~10 s on home connection) and OCR returns text. 2. Local: trigger `/train-sender` with a sample ZIP — confirm training endpoint still functional. 3. Staging dry run: SSH to `raddatz.cloud`, `docker volume rm archiv-staging_ocr-cache` (force fresh-volume path), `docker compose ... up -d ocr-service`, repeat #1 against staging. 4. Verify `/app/cache/.tmp` ownership inside the running container: `docker exec ... ls -ldn /app/cache/.tmp` should show `ocr:ocr` (uid 1000). 5. Stress test (only if Approach B/C chosen): chain three 1.34 GB downloads concurrently to confirm tmpfs limit is not the bottleneck. ## Files to touch - `docker-compose.yml` — `ocr-service.environment` (Approach A or C) and/or `ocr-service.tmpfs` (Approach B or C). Update the inline comment on the tmpfs line to reflect the new sizing rationale. - `ocr-service/entrypoint.sh` — Approach A or C: prepend `mkdir -p "${TMPDIR:-/tmp}"` so a fresh `ocr_cache` volume gets the directory on first start. Idempotent; safe on already-populated volumes. - `ocr-service/CLAUDE.md` and/or `README.md` — one-line note about the `TMPDIR` convention (Approach A or C only) so future contributors don't undo it. **Not** touched: - `Dockerfile` — `/app/cache/.tmp` cannot be baked in because the volume mount shadows it. - `docker-compose.prod.yml` / staging overlays — env var is already inherited from the base compose. ## Out of scope (separate tickets) - Kraken `german_kurrent.mlmodel` is missing from `archiv-staging_ocr-models`. Pre-existing gap; Surya works without it, but Kurrent-specific OCR is disabled until the model is seeded. File a separate issue. - Today's manual `chown -R 1000:1000` of `archiv-staging_ocr-cache` and `archiv-staging_ocr-models` worked but is not reproducible from version control. A separate hardening ticket should make volume ownership automatic — e.g. via a one-shot init container or a volume-bootstrap step in CI/CD. - ADR or commit comment explaining the `TMPDIR` convention if Approach A/C is chosen — author's call; not required for merge. ## Related work / history - `1aca4c4a security(ocr): add non-root user and set HOME/HF_HOME in Dockerfile` - `ab24786d security(ocr): harden compose — fix cache volume path, add read_only + cap_drop` - `fc8b4b16 security(ocr): redirect XDG cache and Torch home away from read-only HOME` - `f1e0b92f style(ocr): normalize cap_drop to block notation in docker-compose.yml` - Manual fix applied today on `raddatz.cloud` (2026-05-18): `docker stop archiv-staging-ocr-service-1 && docker run --rm -v archiv-staging_ocr-cache:/c alpine chown -R 1000:1000 /c && docker run --rm -v archiv-staging_ocr-models:/m alpine chown -R 1000:1000 /m && docker start archiv-staging-ocr-service-1`. Volumes now correctly owned by uid 1000; service healthy; this issue is what surfaces next.

marcel added the P0-critical bug devops labels 2026-05-18 10:39:03 +02:00

marcel commented

2026-05-18 10:41:05 +02:00

🏛️ Markus Keller — Senior Application Architect

Observations

The three approaches are well-framed, but the discussion conflates two architectural concerns that should be separated: where Surya's download staging happens vs what /tmp is for (training-ZIP unpacking, transient PDF buffers). Approach A correctly separates them; B collapses them back into one budget.
The Surya 1.34 GB model is not an outlier — text_recognition is one of several Surya models, and surya_engine is the project's default OCR engine (see engines/__init__.py import in main.py:27). Future Surya releases will be in this size range or larger.
The "manual chown" history described in the issue is exactly the kind of out-of-band state mutation that ADR-001 (single-node OCR) should have warned about. Volume ownership belongs to infrastructure-as-code, not to ad-hoc docker run --rm alpine chown.
docker-compose.yml and docker-compose.prod.yml both carry the identical tmpfs: /tmp:size=512m comment about "training endpoints write ZIPs to /tmp" — duplicated config drift waiting to happen.

Recommendations

Pick Approach A. It is the only one of the three that respects the boundary between "model artifacts on persistent SSD" and "ephemeral RAM-backed scratch." Approach B violates the rule "push state to where it belongs" — staging GB-scale model downloads through a RAM tmpfs is using the wrong storage tier. Approach C is hedging: take the cleaner design or take the simpler one, not both.
Write ADR-008 documenting the TMPDIR=/app/cache/.tmp convention. The decision has the lifetime signature of an ADR: it constrains future contributors (they cannot move /app/cache to bind-mount semantics without breaking this), it cites a concrete failure mode (ENOSPC), and it survives across deployments. Per CLAUDE.md, the author note in the issue ("ADR or commit comment ... author's call") understates this — TMPDIR pointing inside a volume mount is a non-obvious decision that will be reverted by someone in 2 years if there is no ADR.
Move the out-of-scope "automate volume ownership" item up the priority list and link it from this issue. The chown migration story is fragile; if the eventual fix is an init-container, that container should also mkdir -p /app/cache/.tmp (Approach A's mkdir step becomes redundant). Solving them together avoids a second entrypoint.sh modification later.
Inline-comment the TMPDIR env var with the threat model, not the mechanism. Something like # Stage GB-scale model downloads on the disk-backed cache volume, not the RAM tmpfs. See ADR-008. is the kind of comment Nora's persona principle ("explain why this is safe") calls for.

Open Decisions

(none — Approach A is clearly the right architectural choice; the ADR + the volume-bootstrap follow-up issue are blocking-quality recommendations, not tradeoffs)

## 🏛️ Markus Keller — Senior Application Architect ### Observations - The three approaches are well-framed, but the discussion conflates two architectural concerns that should be separated: **where Surya's *download* staging happens** vs **what `/tmp` is for** (training-ZIP unpacking, transient PDF buffers). Approach A correctly separates them; B collapses them back into one budget. - The Surya 1.34 GB model is *not* an outlier — `text_recognition` is one of several Surya models, and `surya_engine` is the project's default OCR engine (see `engines/__init__.py` import in `main.py:27`). Future Surya releases will be in this size range or larger. - The "manual chown" history described in the issue is exactly the kind of out-of-band state mutation that ADR-001 (single-node OCR) should have warned about. Volume ownership belongs to infrastructure-as-code, not to ad-hoc `docker run --rm alpine chown`. - `docker-compose.yml` and `docker-compose.prod.yml` both carry the identical `tmpfs: /tmp:size=512m` comment about "training endpoints write ZIPs to /tmp" — duplicated config drift waiting to happen. ### Recommendations - **Pick Approach A.** It is the only one of the three that respects the boundary between "model artifacts on persistent SSD" and "ephemeral RAM-backed scratch." Approach B violates the rule "push state to where it belongs" — staging GB-scale model downloads through a RAM tmpfs is using the wrong storage tier. Approach C is hedging: take the cleaner design or take the simpler one, not both. - **Write ADR-008** documenting the `TMPDIR=/app/cache/.tmp` convention. The decision has the lifetime signature of an ADR: it constrains future contributors (they cannot move `/app/cache` to bind-mount semantics without breaking this), it cites a concrete failure mode (ENOSPC), and it survives across deployments. Per CLAUDE.md, the author note in the issue ("ADR or commit comment ... author's call") understates this — `TMPDIR` pointing inside a volume mount is a non-obvious decision that *will* be reverted by someone in 2 years if there is no ADR. - **Move the out-of-scope "automate volume ownership" item up the priority list and link it from this issue.** The chown migration story is fragile; if the eventual fix is an init-container, that container should also `mkdir -p /app/cache/.tmp` (Approach A's mkdir step becomes redundant). Solving them together avoids a second `entrypoint.sh` modification later. - **Inline-comment the `TMPDIR` env var with the threat model**, not the mechanism. Something like `# Stage GB-scale model downloads on the disk-backed cache volume, not the RAM tmpfs. See ADR-008.` is the kind of comment Nora's persona principle ("explain *why* this is safe") calls for. ### Open Decisions - _(none — Approach A is clearly the right architectural choice; the ADR + the volume-bootstrap follow-up issue are blocking-quality recommendations, not tradeoffs)_

marcel commented

2026-05-18 10:41:26 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

Three tempfile.TemporaryDirectory() call sites in ocr-service/main.py (lines 383, 480, 557 — /train, /train-sender, /segtrain). All three currently rely on /tmp. Approach A's TMPDIR redirect fixes all three at once without code changes.
The actual ENOSPC is inside the Surya library (surya/common/s3.py:55), not our code — we cannot wrap or patch the call site. TMPDIR is the only knob the host can turn without forking Surya.
entrypoint.sh is currently 9 lines and does exactly one thing (validate blla model then exec uvicorn). Adding an mkdir -p is a single line that fits its style.
The Dockerfile already declares HF_HOME=/app/cache etc. as ENV. Putting TMPDIR only in compose means container users who run the image with no compose layer (e.g. docker run for diagnostics) lose the safety. But putting ENV TMPDIR=/app/cache/.tmp in the Dockerfile risks pointing at a path that doesn't exist on a fresh image (no compose-mounted volume). The entrypoint mkdir -p "$TMPDIR" resolves both — the Dockerfile sets the default, the entrypoint guarantees the directory.

Recommendations

Approach A. Set TMPDIR=/app/cache/.tmp in docker-compose.yml AND docker-compose.prod.yml (both files, atomic commit — they diverged silently once already with the 512 MB tmpfs comment, that's the smell to avoid). Also add ENV TMPDIR=/app/cache/.tmp to the Dockerfile as a default so docker run archive-ocr doesn't blow up either.
Entrypoint change is exactly:
```
mkdir -p "${TMPDIR:-/tmp}"
```
Place it before python3 /app/ensure_blla_model.py so the BLLA fallback download also benefits. The :-/tmp fallback keeps the script safe if TMPDIR is ever unset.
TDD path for this fix: write a pytest in ocr-service/ that calls tempfile.gettempdir() in a subprocess started with TMPDIR=/some/path and asserts it resolves to that path. Then write an integration-style test that does tempfile.TemporaryDirectory() and verifies the path begins with os.environ["TMPDIR"]. Both are <10-line tests. The entrypoint.sh mkdir line is harder to TDD; assert it via a shellcheck + a smoke step in CI that runs docker compose build && docker compose run --rm ocr-service ls -ld /app/cache/.tmp.
Skip the orphaned-file cleanup mitigation the issue mentions (find $TMPDIR -mtime +1 -delete). It is overengineering for a 1.8 TB disk with sub-GB orphan candidates. YAGNI — revisit if monitoring ever flags it.
Doc updates required by CLAUDE.md after this PR:
- ocr-service/README.md — add TMPDIR to the env var table (it has rows for HF_HOME, XDG_CACHE_HOME, TORCH_HOME already; one more matches the pattern).
- ocr-service/CLAUDE.md — single LLM-reminder bullet ("TMPDIR points into the persistent cache volume; do not redirect to RAM tmpfs").
- New ADR (per Markus's recommendation). Skip the c4 diagrams — no service topology changes.

Open Decisions

(none — Approach A with the exact diff above. The TDD plan covers the regression we care about)

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Observations - Three `tempfile.TemporaryDirectory()` call sites in `ocr-service/main.py` (lines 383, 480, 557 — `/train`, `/train-sender`, `/segtrain`). All three currently rely on `/tmp`. Approach A's `TMPDIR` redirect fixes all three at once without code changes. - The actual ENOSPC is inside the Surya library (`surya/common/s3.py:55`), not our code — we cannot wrap or patch the call site. `TMPDIR` is the only knob the host can turn without forking Surya. - `entrypoint.sh` is currently 9 lines and does exactly one thing (validate blla model then exec uvicorn). Adding an `mkdir -p` is a single line that fits its style. - The Dockerfile already declares `HF_HOME=/app/cache` etc. as `ENV`. Putting `TMPDIR` *only* in compose means container users who run the image with no compose layer (e.g. `docker run` for diagnostics) lose the safety. But putting `ENV TMPDIR=/app/cache/.tmp` in the Dockerfile risks pointing at a path that doesn't exist on a fresh image (no compose-mounted volume). The entrypoint `mkdir -p "$TMPDIR"` resolves both — the Dockerfile sets the default, the entrypoint guarantees the directory. ### Recommendations - **Approach A.** Set `TMPDIR=/app/cache/.tmp` in `docker-compose.yml` AND `docker-compose.prod.yml` (both files, atomic commit — they diverged silently once already with the 512 MB tmpfs comment, that's the smell to avoid). Also add `ENV TMPDIR=/app/cache/.tmp` to the Dockerfile as a default so `docker run archive-ocr` doesn't blow up either. - **Entrypoint change** is exactly: ```bash mkdir -p "${TMPDIR:-/tmp}" ``` Place it *before* `python3 /app/ensure_blla_model.py` so the BLLA fallback download also benefits. The `:-/tmp` fallback keeps the script safe if `TMPDIR` is ever unset. - **TDD path for this fix:** write a pytest in `ocr-service/` that calls `tempfile.gettempdir()` in a subprocess started with `TMPDIR=/some/path` and asserts it resolves to that path. Then write an integration-style test that does `tempfile.TemporaryDirectory()` and verifies the path begins with `os.environ["TMPDIR"]`. Both are <10-line tests. The `entrypoint.sh` `mkdir` line is harder to TDD; assert it via a shellcheck + a smoke step in CI that runs `docker compose build && docker compose run --rm ocr-service ls -ld /app/cache/.tmp`. - **Skip the orphaned-file cleanup mitigation** the issue mentions (`find $TMPDIR -mtime +1 -delete`). It is overengineering for a 1.8 TB disk with sub-GB orphan candidates. YAGNI — revisit if monitoring ever flags it. - **Doc updates** required by CLAUDE.md after this PR: - `ocr-service/README.md` — add `TMPDIR` to the env var table (it has rows for `HF_HOME`, `XDG_CACHE_HOME`, `TORCH_HOME` already; one more matches the pattern). - `ocr-service/CLAUDE.md` — single LLM-reminder bullet ("TMPDIR points into the persistent cache volume; do not redirect to RAM tmpfs"). - New ADR (per Markus's recommendation). Skip the c4 diagrams — no service topology changes. ### Open Decisions - _(none — Approach A with the exact diff above. The TDD plan covers the regression we care about)_

marcel commented

2026-05-18 10:41:52 +02:00

🛠️ Tobias Wendt — DevOps & Platform Engineer

Observations

The issue is well-diagnosed; the root-cause section is the kind of post-mortem write-up I want to see. Three real-world things stand out:
1. Both compose files carry the stale "20-50 images" comment. docker-compose.yml:114 and docker-compose.prod.yml:169 are byte-identical on the tmpfs line. Whichever approach lands, the comment update has to be made in both files in the same commit. (We learned this lesson with #526 already — compose config rendered output is the canonical check.)
2. Approach B's 4 GB tmpfs competes with mem_limit: 12g on ocr-service. tmpfs allocations do count against cgroup memory in modern Docker (kernel ≥ 4.x). A 4 GB tmpfs + Surya's ~5 GB resident model + 12g cgroup limit is fine on CX42 (16 GB host RAM), but on CX32 with OCR_MEM_LIMIT=6g (documented in docs/DEPLOYMENT.md:143) it will OOMKill on cold start. That alone disqualifies B for the staging-on-CX42 / future-prod-on-CX32 path.
3. The chown war story is a CI gap, not a "manual hardening" story. Staging deploys via nightly.yml should not require a human SSH session with docker run --rm alpine chown. That's the issue we should be tracking even more urgently than this one.
The cache volume reuses across container recreates is exactly what we want — Approach A keeps that property intact (the .tmp subdirectory sits inside the same volume, gets created on first start, survives restarts).

Recommendations

Approach A for the reasons in the issue body, plus: it's cheaper on RAM, it survives CX32 sizing, and shutil.move between same-filesystem paths becomes rename(2) — atomic and ~free. Approach B trades correctness for diff size; that's a poor trade in infra code.

Concrete compose diff I'd ship (both files, same commit):

environment:
  ...
  TMPDIR: /app/cache/.tmp           # NEW: stage GB-scale model downloads on SSD, not the 512 MB tmpfs
tmpfs:
  - /tmp:size=512m                  # training-ZIP unzip + transient PDF buffers only (small, RAM-friendly)

Update both inline comments. Audit compose config output side-by-side before merging to catch drift.

Open a follow-up issue immediately (don't leave it in "Out of scope"): devops(ocr): automate ocr_cache + ocr_models volume ownership on first start. Recommendation is a one-shot init container in docker-compose.prod.yml that runs before ocr-service:
```
ocr-volume-init:
  image: alpine:3
  command: ["sh", "-c", "chown -R 1000:1000 /app/cache /app/models && mkdir -p /app/cache/.tmp && chown 1000:1000 /app/cache/.tmp"]
  volumes:
    - ocr-cache:/app/cache
    - ocr-models:/app/models
  restart: "no"
```
Pair with depends_on: ocr-volume-init: condition: service_completed_successfully on ocr-service (same pattern as create-buckets in docker-compose.prod.yml:191). This makes the manual chown history non-repeatable — exactly what infra-as-code is for. Once that init container exists, the entrypoint mkdir -p "$TMPDIR" becomes redundant; consider sequencing the two PRs so we only touch entrypoint.sh once.
CI smoke for this fix: the nightly.yml already does a docker compose config regression check for the import mount. Add an analogous assertion: grep -q 'TMPDIR: /app/cache/.tmp' /tmp/compose-rendered.yml so the env var cannot be accidentally dropped by a future "cleanup" PR. Costs 2 lines, prevents a 1 AM redeploy.
Renovate / image pinning is fine here — no image bumps involved. But while we're in docker-compose.prod.yml, note that mailpit:v1.29.7 and minio:RELEASE.2025-02-28T09-55-16Z are still hand-pinned. Separate ticket.

Open Decisions

(none — Approach A is the clear choice given CX32 sizing constraints and the cgroup-memory accounting of tmpfs. The volume-bootstrap follow-up is a recommendation, not an open question)

## 🛠️ Tobias Wendt — DevOps & Platform Engineer ### Observations - The issue is well-diagnosed; the root-cause section is the kind of post-mortem write-up I want to see. Three real-world things stand out: 1. **Both compose files carry the stale "20-50 images" comment.** `docker-compose.yml:114` and `docker-compose.prod.yml:169` are byte-identical on the tmpfs line. Whichever approach lands, the comment update has to be made in both files in the same commit. (We learned this lesson with #526 already — `compose config` rendered output is the canonical check.) 2. **Approach B's 4 GB tmpfs** competes with `mem_limit: 12g` on `ocr-service`. tmpfs allocations *do* count against cgroup memory in modern Docker (kernel ≥ 4.x). A 4 GB tmpfs + Surya's ~5 GB resident model + 12g cgroup limit is fine on CX42 (16 GB host RAM), but on CX32 with `OCR_MEM_LIMIT=6g` (documented in `docs/DEPLOYMENT.md:143`) it will OOMKill on cold start. That alone disqualifies B for the staging-on-CX42 / future-prod-on-CX32 path. 3. **The chown war story is a CI gap, not a "manual hardening" story.** Staging deploys via `nightly.yml` should not require a human SSH session with `docker run --rm alpine chown`. That's the issue we should be tracking even more urgently than this one. - The cache volume reuses across container recreates is exactly what we want — Approach A keeps that property intact (the `.tmp` subdirectory sits inside the same volume, gets created on first start, survives restarts). ### Recommendations - **Approach A** for the reasons in the issue body, plus: it's cheaper on RAM, it survives CX32 sizing, and `shutil.move` between same-filesystem paths becomes `rename(2)` — atomic and ~free. Approach B trades correctness for diff size; that's a poor trade in infra code. - **Concrete compose diff** I'd ship (both files, same commit): ```yaml environment: ... TMPDIR: /app/cache/.tmp # NEW: stage GB-scale model downloads on SSD, not the 512 MB tmpfs tmpfs: - /tmp:size=512m # training-ZIP unzip + transient PDF buffers only (small, RAM-friendly) ``` Update both inline comments. Audit `compose config` output side-by-side before merging to catch drift. - **Open a follow-up issue immediately** (don't leave it in "Out of scope"): `devops(ocr): automate ocr_cache + ocr_models volume ownership on first start`. Recommendation is a one-shot init container in `docker-compose.prod.yml` that runs before `ocr-service`: ```yaml ocr-volume-init: image: alpine:3 command: ["sh", "-c", "chown -R 1000:1000 /app/cache /app/models && mkdir -p /app/cache/.tmp && chown 1000:1000 /app/cache/.tmp"] volumes: - ocr-cache:/app/cache - ocr-models:/app/models restart: "no" ``` Pair with `depends_on: ocr-volume-init: condition: service_completed_successfully` on `ocr-service` (same pattern as `create-buckets` in `docker-compose.prod.yml:191`). This makes the manual chown history non-repeatable — exactly what infra-as-code is for. Once that init container exists, the entrypoint `mkdir -p "$TMPDIR"` becomes redundant; consider sequencing the two PRs so we only touch `entrypoint.sh` once. - **CI smoke for this fix:** the `nightly.yml` already does a `docker compose config` regression check for the import mount. Add an analogous assertion: `grep -q 'TMPDIR: /app/cache/.tmp' /tmp/compose-rendered.yml` so the env var cannot be accidentally dropped by a future "cleanup" PR. Costs 2 lines, prevents a 1 AM redeploy. - **Renovate / image pinning** is fine here — no image bumps involved. But while we're in `docker-compose.prod.yml`, note that `mailpit:v1.29.7` and `minio:RELEASE.2025-02-28T09-55-16Z` are still hand-pinned. Separate ticket. ### Open Decisions - _(none — Approach A is the clear choice given CX32 sizing constraints and the cgroup-memory accounting of tmpfs. The volume-bootstrap follow-up is a recommendation, not an open question)_

marcel commented

2026-05-18 10:42:20 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer

Observations

The recent hardening (1aca4c4a, ab24786d, fc8b4b16, f1e0b92f) is correctly enumerated in the AC: non-root user, read_only: true, cap_drop: [ALL]. None of the three approaches weakens those. Good.
One subtlety: by redirecting TMPDIR to /app/cache/.tmp (Approach A), user-uploaded ZIP extraction in /train, /train-sender, /segtrain (main.py:383, 480, 557) now writes to the persistent cache volume instead of the RAM tmpfs. That has two security implications worth naming:
1. Cross-job leakage window widens. Previously, training scratch lived in tmpfs and disappeared on container restart. Now, if a training run is docker killed mid-extract, ZIP entries persist on the SSD across container lifecycles. The issue body mentions this in passing but downplays it as a disk-space issue. From a security standpoint, the persistence itself is the concern — partial ground-truth data from training-run N could still be on disk during training-run N+1. Mitigation: the existing _validate_zip_entry() ZIP Slip checks (Felix's persona file documents them) still apply, so an attacker can't write outside .tmp. But within .tmp, orphans accumulate.
2. The 512 MB tmpfs cap was an implicit DoS guardrail. Approach B's 4 GB raises that DoS cap by 8x. The TRAINING_TOKEN check (main.py — _check_training_token) gates /train* endpoints, so this is post-auth DoS, not unauthenticated. Lower severity, but worth naming. Approach A preserves the original 512 MB cap on the /tmp-using paths.
Approach A's mkdir -p "$TMPDIR" in entrypoint.sh runs as the ocr user (uid 1000), inside a read_only: true rootfs, with the /app/cache volume mounted RW. That's the only place this mkdir can succeed. It cannot escalate.
read_only: true is preserved in all three approaches — verified against docker-compose.yml:112 and docker-compose.prod.yml:167.

Recommendations

Approach A, for security reasons in addition to the operational ones: it keeps the 512 MB DoS cap on attacker-influenceable paths (/tmp is what tempfile falls back to only if TMPDIR is unset, which it now never is). Approach B raises the attacker's playing field. Approach C does both for no security gain over A.
Add a cleanup-on-startup line to entrypoint.sh to address the cross-job leakage concern from Observation #1:
```
mkdir -p "${TMPDIR:-/tmp}"
# Clear orphaned scratch from previous container crashes (training-ZIP fragments, partial Surya downloads)
find "${TMPDIR:-/tmp}" -mindepth 1 -mtime +1 -delete 2>/dev/null || true
```
This is cheap (runs once at container start, on a directory expected to be small or empty), bounded (-mtime +1 prevents nuking an in-progress download started seconds before a restart), and security-positive (no stale ground-truth XML floats across runs). Counter-argues Felix's "skip cleanup as YAGNI" — disagreed: this is a multi-tenant-ish endpoint (/train accepts arbitrary uploads) and the cleanup costs nothing.
Permanent regression test for the security boundary: add a pytest in ocr-service/test_training_auth.py (it already exists) that asserts the ZIP Slip check still fires under the new TMPDIR. The threat model didn't change, but the working directory did — and _validate_zip_entry() resolves against tmp_dir which is now /app/cache/.tmp/tmpXXX/.... Confirm os.path.realpath() still produces the correct anchoring.

Threat-model comment on the TMPDIR env var (CLAUDE.md says comments explain the threat model, not the code):

# Route GB-scale Surya model downloads to persistent SSD, not the 512 MB RAM tmpfs.
# /tmp keeps its small DoS cap; attacker-uploadable training ZIPs still unpack
# under this directory but ZIP Slip protection (_validate_zip_entry) is unchanged.
TMPDIR: /app/cache/.tmp

Open Decisions

(none — Approach A + cleanup line is the security-correct path)

## 🔐 Nora "NullX" Steiner — Application Security Engineer ### Observations - The recent hardening (`1aca4c4a`, `ab24786d`, `fc8b4b16`, `f1e0b92f`) is correctly enumerated in the AC: non-root user, `read_only: true`, `cap_drop: [ALL]`. None of the three approaches weakens those. Good. - One subtlety: by redirecting `TMPDIR` to `/app/cache/.tmp` (Approach A), user-uploaded ZIP extraction in `/train`, `/train-sender`, `/segtrain` (main.py:383, 480, 557) now writes to the persistent cache volume instead of the RAM tmpfs. That has two security implications worth naming: 1. **Cross-job leakage window widens.** Previously, training scratch lived in tmpfs and disappeared on container restart. Now, if a training run is `docker kill`ed mid-extract, ZIP entries persist on the SSD across container lifecycles. The issue body mentions this in passing but downplays it as a disk-space issue. From a security standpoint, *the persistence itself* is the concern — partial ground-truth data from training-run N could still be on disk during training-run N+1. Mitigation: the existing `_validate_zip_entry()` ZIP Slip checks (Felix's persona file documents them) still apply, so an attacker can't *write outside* `.tmp`. But within `.tmp`, orphans accumulate. 2. **The 512 MB tmpfs cap was an implicit DoS guardrail.** Approach B's 4 GB raises that DoS cap by 8x. The `TRAINING_TOKEN` check (main.py — `_check_training_token`) gates `/train*` endpoints, so this is post-auth DoS, not unauthenticated. Lower severity, but worth naming. Approach A preserves the original 512 MB cap on the `/tmp`-using paths. - Approach A's `mkdir -p "$TMPDIR"` in `entrypoint.sh` runs as the `ocr` user (uid 1000), inside a `read_only: true` rootfs, with the `/app/cache` volume mounted RW. That's the *only* place this `mkdir` can succeed. It cannot escalate. - `read_only: true` is preserved in all three approaches — verified against `docker-compose.yml:112` and `docker-compose.prod.yml:167`. ### Recommendations - **Approach A**, for security reasons in addition to the operational ones: it keeps the 512 MB DoS cap on attacker-influenceable paths (`/tmp` is what `tempfile` falls back to *only* if `TMPDIR` is unset, which it now never is). Approach B raises the attacker's playing field. Approach C does both for no security gain over A. - **Add a cleanup-on-startup line to `entrypoint.sh`** to address the cross-job leakage concern from Observation #1: ```bash mkdir -p "${TMPDIR:-/tmp}" # Clear orphaned scratch from previous container crashes (training-ZIP fragments, partial Surya downloads) find "${TMPDIR:-/tmp}" -mindepth 1 -mtime +1 -delete 2>/dev/null || true ``` This is cheap (runs once at container start, on a directory expected to be small or empty), bounded (`-mtime +1` prevents nuking an in-progress download started seconds before a restart), and security-positive (no stale ground-truth XML floats across runs). Counter-argues Felix's "skip cleanup as YAGNI" — disagreed: this is a multi-tenant-ish endpoint (`/train` accepts arbitrary uploads) and the cleanup costs nothing. - **Permanent regression test for the security boundary:** add a pytest in `ocr-service/test_training_auth.py` (it already exists) that asserts the ZIP Slip check still fires under the new `TMPDIR`. The threat model didn't change, but the working directory did — and `_validate_zip_entry()` resolves against `tmp_dir` which is now `/app/cache/.tmp/tmpXXX/...`. Confirm `os.path.realpath()` still produces the correct anchoring. - **Threat-model comment** on the `TMPDIR` env var (CLAUDE.md says comments explain the threat model, not the code): ```yaml # Route GB-scale Surya model downloads to persistent SSD, not the 512 MB RAM tmpfs. # /tmp keeps its small DoS cap; attacker-uploadable training ZIPs still unpack # under this directory but ZIP Slip protection (_validate_zip_entry) is unchanged. TMPDIR: /app/cache/.tmp ``` ### Open Decisions - _(none — Approach A + cleanup line is the security-correct path)_

marcel commented

2026-05-18 10:42:39 +02:00

🧪 Sara Holt — Senior QA Engineer

Observations

The "Acceptance criteria" and "Test plan" sections are unusually thorough for a bug report — both happy path (guided OCR succeeds) and regression path (/train-sender still works) are named, and the AC explicitly call out read_only: true / cap_drop / non-root preservation. That's the right shape.
However, every step in the Test plan is a manual smoke test. There is no permanent automated regression. If this ever happens again — say a future model bump in Surya pushes past 1.8 TB or a developer "cleans up" the TMPDIR env var — we will rediscover it in production again.
ocr-service/test_main.py, test_training_auth.py, test_stream.py, test_ensure_blla_model.py all exist — there is an established Python test surface to extend.
Critical gap in the AC: "On a fresh ocr_cache volume, the chosen approach still works" is named, but the verification is left to manual docker volume rm + restart. This is exactly the case that broke staging (pre-existing volumes with 0:0 ownership). It deserves an automated check.

Recommendations

Add three pytest cases in ocr-service/test_main.py (or a new test_tmpdir.py) — red/green TDD style:
1. def test_tempfile_uses_tmpdir_when_set(monkeypatch, tmp_path): — monkeypatch TMPDIR to tmp_path, call tempfile.TemporaryDirectory(), assert the returned path begins with tmp_path. Proves Python honours the env var.
2. def test_entrypoint_creates_tmpdir(): — shell out to bash entrypoint.sh (mock python3 and uvicorn), set TMPDIR=/tmp/test-tmpdir-xyz, assert directory exists afterward. Proves the mkdir -p works.
3. def test_tmpdir_is_inside_persistent_volume_path(): — assert that the configured TMPDIR in environment lives under /app/cache (matches the compose contract). Catches drift if someone later writes TMPDIR=/tmp/something.
CI smoke step (DevOps-owned): pair with Tobias's compose config | grep TMPDIR assertion. That covers the YAML side; my tests cover the runtime side.
Add an explicit AC line:
- A pytest in ocr-service/ proves that tempfile.TemporaryDirectory() resolves under the configured TMPDIR and that entrypoint.sh creates the directory if absent. The test runs in CI on every push.
Drop the "Stress test (only if Approach B/C chosen): chain three 1.34 GB downloads" step. Approach A makes it unnecessary, and even under B it's not a useful test — it's a hard-to-reproduce manual exercise. If we worry about concurrent downloads, write a deterministic test that asserts tempfile chooses TMPDIR with N parallel tempfile.TemporaryDirectory() calls in threads.
Strengthen step 4 of the Test plan: instead of ls -ldn, write a one-line bash assertion that exits non-zero on the wrong owner, so it works as a CI step too:
```
docker exec archive-ocr test "$(stat -c %u /app/cache/.tmp)" = "1000" || exit 1
```
Test pyramid placement: these are unit tests (<10s). They belong at the unit layer, not in Playwright. No E2E impact.

Open Decisions

(none — the AC need one more line for the automated regression test; the rest of the plan is sound)

## 🧪 Sara Holt — Senior QA Engineer ### Observations - The "Acceptance criteria" and "Test plan" sections are unusually thorough for a bug report — both happy path (guided OCR succeeds) and regression path (`/train-sender` still works) are named, and the AC explicitly call out `read_only: true` / `cap_drop` / non-root preservation. That's the right shape. - However, every step in the Test plan is a **manual smoke test**. There is no permanent automated regression. If this ever happens again — say a future model bump in Surya pushes past 1.8 TB or a developer "cleans up" the `TMPDIR` env var — we will rediscover it in production again. - `ocr-service/test_main.py`, `test_training_auth.py`, `test_stream.py`, `test_ensure_blla_model.py` all exist — there is an established Python test surface to extend. - **Critical gap in the AC:** "On a fresh `ocr_cache` volume, the chosen approach still works" is named, but the verification is left to manual `docker volume rm` + restart. This is exactly the case that broke staging (pre-existing volumes with `0:0` ownership). It deserves an automated check. ### Recommendations - **Add three pytest cases in `ocr-service/test_main.py` (or a new `test_tmpdir.py`) — red/green TDD style:** 1. `def test_tempfile_uses_tmpdir_when_set(monkeypatch, tmp_path):` — monkeypatch `TMPDIR` to `tmp_path`, call `tempfile.TemporaryDirectory()`, assert the returned path begins with `tmp_path`. Proves Python honours the env var. 2. `def test_entrypoint_creates_tmpdir():` — shell out to `bash entrypoint.sh` (mock `python3` and `uvicorn`), set `TMPDIR=/tmp/test-tmpdir-xyz`, assert directory exists afterward. Proves the `mkdir -p` works. 3. `def test_tmpdir_is_inside_persistent_volume_path():` — assert that the configured `TMPDIR` in environment lives under `/app/cache` (matches the compose contract). Catches drift if someone later writes `TMPDIR=/tmp/something`. - **CI smoke step (DevOps-owned)**: pair with Tobias's `compose config | grep TMPDIR` assertion. That covers the YAML side; my tests cover the runtime side. - **Add an explicit AC line:** > - [ ] A `pytest` in `ocr-service/` proves that `tempfile.TemporaryDirectory()` resolves under the configured `TMPDIR` and that `entrypoint.sh` creates the directory if absent. The test runs in CI on every push. - **Drop the "Stress test (only if Approach B/C chosen): chain three 1.34 GB downloads" step.** Approach A makes it unnecessary, and even under B it's not a useful test — it's a hard-to-reproduce manual exercise. If we worry about concurrent downloads, write a deterministic test that asserts tempfile chooses TMPDIR with N parallel `tempfile.TemporaryDirectory()` calls in threads. - **Strengthen step 4 of the Test plan:** instead of `ls -ldn`, write a one-line bash assertion that exits non-zero on the wrong owner, so it works as a CI step too: ```bash docker exec archive-ocr test "$(stat -c %u /app/cache/.tmp)" = "1000" || exit 1 ``` - **Test pyramid placement:** these are unit tests (<10s). They belong at the unit layer, not in Playwright. No E2E impact. ### Open Decisions - _(none — the AC need one more line for the automated regression test; the rest of the plan is sound)_

marcel commented

2026-05-18 10:43:06 +02:00

📋 "Elicit" — Requirements Engineer

Observations

This is a brownfield bug ticket with an unusually well-structured spec — root cause, three numbered approaches with explicit pros/cons, AC, test plan, files-to-touch, out-of-scope. Most of my Definition-of-Ready boxes are pre-checked.
One ambiguity in scope: the "Out of scope" section lists three items, but two of them (the missing german_kurrent.mlmodel, and the volume-ownership automation) are mentioned in the AC's context (#6: "the chosen approach still works on a fresh ocr_cache volume"). The AC implicitly depends on the volume being writable by uid 1000 — which is currently solved only by the manual chown. Without the volume-bootstrap automation in scope, AC #6 is satisfiable only on environments where someone has already chowned. That is a hidden dependency.
The AC is missing two NFR-flavoured items: (a) no automated regression test is required (Sara has flagged this), (b) no observability requirement is named (does first-request download time show up anywhere? a long stall on staging without telemetry is exactly what hid this bug for some hours).
"Out of scope: ADR or commit comment ... author's call; not required for merge" contradicts CLAUDE.md / Markus's persona: ADRs are required for architectural decisions with lasting consequences. TMPDIR redirected to a persistent volume mount is exactly that. Either the ADR is in scope or CLAUDE.md needs updating — but the issue should not be the place that quietly waives the doc requirement.

Recommendations

Promote two items from "Out of scope" to acceptance criteria (or to explicit blocking sub-issues linked from this one):
1. Volume-ownership automation — without it, AC #6 is theoretical. Either fix it here or link an issue and make that issue a blocker.
2. ADR-008 for the TMPDIR convention — required per project doc rules, not optional.
Add an explicit "out of scope: code changes to Surya's surya/common/s3.py" so a reviewer doesn't ask why we didn't patch upstream. The reason (we can't fork OCR libraries for hosting tweaks) is sound; just name it.
Tighten AC into Given-When-Then form for the critical path — this issue is procedural enough to deserve it:
- Given a fresh ocr_cache volume owned by uid 1000, when the ocr-service container starts and receives a guided-OCR request for a Surya model not previously cached, then the model downloads to /app/cache/datalab/models/text_recognition/2025_09_23/ and inference returns text within the existing healthcheck start_period.
- Given an existing ocr_cache volume with stale 0:0 ownership (pre-#612 state), when... — covers AC #6.
- Given a successful guided-OCR run, when /train-sender is invoked with a 20-image ZIP, then training completes without ENOSPC and the cached Surya model is preserved.
Add an explicit performance NFR: "First Surya download (~1.34 GB, ~10s on home connection) completes within 60s; subsequent guided-OCR requests use the cached model and do not re-download." This converts the test-plan note into a measurable requirement.
Add an observability NFR: when the OCR service writes >100 MB to TMPDIR, log it. Today this failure was diagnosed by reading stack traces in container logs — a structured logger.info("Surya model staging to %s", tmpdir) would have spotted the pre-existing-volume issue weeks ago. Lightweight, costs nothing.
Definition-of-Ready check on the issue itself: 7/8 pass. Missing item: explicit T-shirt size — but per project rules (solo workflow memory: "skip T-shirt sizes"), that's intentional. No change needed.

Open Decisions

Whether the volume-ownership automation blocks this PR or is a separate-but-prioritized issue. Cost of blocking: this PR cannot land until both are done (~1 day extra). Cost of separating: AC #6 is verified only by a single manual docker volume rm rather than a permanent CI guarantee. (Raised by Elicit, also implicit in Markus's and Tobias's comments.)

## 📋 "Elicit" — Requirements Engineer ### Observations - This is a brownfield bug ticket with an unusually well-structured spec — root cause, three numbered approaches with explicit pros/cons, AC, test plan, files-to-touch, out-of-scope. Most of my Definition-of-Ready boxes are pre-checked. - One ambiguity in scope: the "Out of scope" section lists three items, but two of them (the missing `german_kurrent.mlmodel`, and the volume-ownership automation) are mentioned in the AC's context (#6: "the chosen approach still works on a fresh `ocr_cache` volume"). The AC implicitly depends on the volume being writable by uid 1000 — which is currently solved only by the manual chown. Without the volume-bootstrap automation in scope, AC #6 is satisfiable only on environments where someone has already chowned. That is a hidden dependency. - The AC is missing two NFR-flavoured items: (a) no automated regression test is required (Sara has flagged this), (b) no observability requirement is named (does first-request download time show up anywhere? a long stall on staging without telemetry is exactly what hid this bug for some hours). - "Out of scope: ADR or commit comment ... author's call; not required for merge" contradicts CLAUDE.md / Markus's persona: ADRs are required for architectural decisions with lasting consequences. `TMPDIR` redirected to a persistent volume mount is exactly that. Either the ADR is in scope or CLAUDE.md needs updating — but the issue should not be the place that quietly waives the doc requirement. ### Recommendations - **Promote two items from "Out of scope" to acceptance criteria** (or to explicit blocking sub-issues linked from this one): 1. *Volume-ownership automation* — without it, AC #6 is theoretical. Either fix it here or link an issue and make that issue a blocker. 2. *ADR-008 for the `TMPDIR` convention* — required per project doc rules, not optional. - **Add an explicit "out of scope: code changes to Surya's `surya/common/s3.py`"** so a reviewer doesn't ask why we didn't patch upstream. The reason (we can't fork OCR libraries for hosting tweaks) is sound; just name it. - **Tighten AC into Given-When-Then form for the critical path** — this issue is procedural enough to deserve it: - Given a fresh `ocr_cache` volume owned by uid 1000, when the `ocr-service` container starts and receives a guided-OCR request for a Surya model not previously cached, then the model downloads to `/app/cache/datalab/models/text_recognition/2025_09_23/` and inference returns text within the existing healthcheck `start_period`. - Given an existing `ocr_cache` volume with stale `0:0` ownership (pre-#612 state), when... — covers AC #6. - Given a successful guided-OCR run, when `/train-sender` is invoked with a 20-image ZIP, then training completes without ENOSPC and the cached Surya model is preserved. - **Add an explicit performance NFR:** "First Surya download (~1.34 GB, ~10s on home connection) completes within 60s; subsequent guided-OCR requests use the cached model and do not re-download." This converts the test-plan note into a measurable requirement. - **Add an observability NFR:** when the OCR service writes >100 MB to `TMPDIR`, log it. Today this failure was diagnosed by reading stack traces in container logs — a structured `logger.info("Surya model staging to %s", tmpdir)` would have spotted the pre-existing-volume issue weeks ago. Lightweight, costs nothing. - **Definition-of-Ready check on the issue itself:** 7/8 pass. Missing item: explicit T-shirt size — but per project rules (solo workflow memory: "skip T-shirt sizes"), that's intentional. No change needed. ### Open Decisions - **Whether the volume-ownership automation blocks this PR or is a separate-but-prioritized issue.** Cost of blocking: this PR cannot land until both are done (~1 day extra). Cost of separating: AC #6 is verified only by a single manual `docker volume rm` rather than a permanent CI guarantee. _(Raised by Elicit, also implicit in Markus's and Tobias's comments.)_

marcel commented

2026-05-18 10:43:13 +02:00

🎨 Leonie Voss — UX & Accessibility Lead

No concerns from my angle. This is an infrastructure bug in the OCR Python service; the user-visible failure mode is "guided OCR returns an error in the UI." I checked: the transcribe panel surfaces backend errors via the existing getErrorMessage() mapping, so the symptom appears as a generic localized OCR-failed message in messages/{de,en,es}.json — no UI change required by this fix.

The one thing worth naming for future reference: a 1.34 GB first-time download takes ~10–30 seconds on a typical connection (the issue notes "~10s on home connection"). If the transcribe panel today shows a generic spinner during that window, the user perceives "the app is stuck." Once this bug is fixed, please check whether the loading state on the first OCR-after-deploy still feels acceptable on a 60+ user's laptop. If it doesn't, that's a separate UX ticket about progressive loading messaging, not part of this fix.

## 🎨 Leonie Voss — UX & Accessibility Lead No concerns from my angle. This is an infrastructure bug in the OCR Python service; the user-visible failure mode is "guided OCR returns an error in the UI." I checked: the transcribe panel surfaces backend errors via the existing `getErrorMessage()` mapping, so the symptom appears as a generic localized OCR-failed message in `messages/{de,en,es}.json` — no UI change required by this fix. The one thing worth naming for future reference: a 1.34 GB first-time download takes ~10–30 seconds on a typical connection (the issue notes "~10s on home connection"). If the transcribe panel today shows a generic spinner during that window, the user perceives "the app is stuck." Once this bug is fixed, please check whether the loading state on the first OCR-after-deploy still feels acceptable on a 60+ user's laptop. If it doesn't, that's a separate UX ticket about progressive loading messaging, not part of this fix.

marcel commented

2026-05-18 10:43:27 +02:00

🗳️ Decision Queue — Action Required

1 decision needs your input before implementation starts.

Infrastructure / Scope

Volume-ownership automation: block this PR, or split into a follow-up?
AC #6 ("On a fresh ocr_cache volume the chosen approach still works") is only honestly satisfiable once the manual chown -R 1000:1000 from 2026-05-18 is replaced by an init-container (or equivalent) in docker-compose.prod.yml. Three options:
- A. Block this PR. Add the ocr-volume-init service (Tobias's snippet) to both compose files in the same PR. ~1 extra day of work, but AC #6 becomes a permanent CI-verifiable guarantee instead of "someone chowned once."
- B. Split into a separate, immediately-prioritized issue. Land Approach A's TMPDIR fix now (resolves staging today), open devops(ocr): automate ocr_cache + ocr_models volume ownership as P1, link both ways. Risk: AC #6 stays theoretical until issue 2 lands.
- C. Defer indefinitely. Document the manual chown as a known step. Not recommended — undoes the post-mortem lessons of this very ticket.
  (Raised by: Elicit, with concurring observations from Markus and Tobias.)

All seven personas (Markus, Felix, Tobias, Nora, Sara, Elicit, Leonie) converged independently on Approach A for the core fix. That part of the design is settled — only this scoping question remains.

## 🗳️ Decision Queue — Action Required _1 decision needs your input before implementation starts._ ### Infrastructure / Scope - **Volume-ownership automation: block this PR, or split into a follow-up?** AC #6 ("On a fresh `ocr_cache` volume the chosen approach still works") is only honestly satisfiable once the manual `chown -R 1000:1000` from 2026-05-18 is replaced by an init-container (or equivalent) in `docker-compose.prod.yml`. Three options: - **A. Block this PR.** Add the `ocr-volume-init` service (Tobias's snippet) to both compose files in the same PR. ~1 extra day of work, but AC #6 becomes a permanent CI-verifiable guarantee instead of "someone chowned once." - **B. Split into a separate, immediately-prioritized issue.** Land Approach A's `TMPDIR` fix now (resolves staging today), open `devops(ocr): automate ocr_cache + ocr_models volume ownership` as P1, link both ways. Risk: AC #6 stays theoretical until issue 2 lands. - **C. Defer indefinitely.** Document the manual chown as a known step. Not recommended — undoes the post-mortem lessons of this very ticket. _(Raised by: Elicit, with concurring observations from Markus and Tobias.)_ All seven personas (Markus, Felix, Tobias, Nora, Sara, Elicit, Leonie) converged independently on **Approach A** for the core fix. That part of the design is settled — only this scoping question remains.

marcel commented

2026-05-18 10:45:09 +02:00

A

marcel referenced this issue from a commit

2026-05-18 10:59:04 +02:00

build(ocr): set ENV TMPDIR=/app/cache/.tmp so docker run uses SSD staging

marcel referenced this issue from a commit

2026-05-18 10:59:04 +02:00

fix(ocr): create TMPDIR on startup and clear day-old orphans

marcel referenced this issue from a commit

2026-05-18 10:59:04 +02:00

fix(ocr): add TMPDIR env var and ocr-volume-init service to compose files

marcel referenced this issue from a commit

2026-05-18 10:59:04 +02:00

docs(ocr): document TMPDIR convention and add ADR-021

marcel commented

2026-05-18 10:59:32 +02:00

Implementation complete — branch `feat/issue-614-tmpdir-persistent-volume`

Approach A implemented. All acceptance criteria addressed. PR forthcoming.

Commits

Commit	Description
`09a04343`	`build(ocr): set ENV TMPDIR=/app/cache/.tmp so docker run uses SSD staging`
`240b373f`	`fix(ocr): create TMPDIR on startup and clear day-old orphans`
`1f7b08b7`	`fix(ocr): add TMPDIR env var and ocr-volume-init service to compose files`
`cfd49ff6`	`docs(ocr): document TMPDIR convention and add ADR-021`

What was implemented

ocr-service/Dockerfile — ENV TMPDIR=/app/cache/.tmp ensures that even bare docker run (without compose) uses SSD staging. Without this, any tool reading TMPDIR before the compose env layer is applied would fall back to /tmp.

ocr-service/entrypoint.sh — Two lines added before ensure_blla_model.py:

mkdir -p "${TMPDIR:-/tmp}" — creates /app/cache/.tmp on fresh volumes (idempotent on existing ones)
find "${TMPDIR:-/tmp}" -mindepth 1 -mtime +1 -delete — clears orphaned fragments from prior docker kill during model downloads (Nora's cross-job leakage concern)

docker-compose.yml and docker-compose.prod.yml — Both files updated in the same commit to prevent the silent drift that already existed:

TMPDIR: /app/cache/.tmp with threat-model comment added to ocr-service.environment
/tmp tmpfs comment updated to reflect its revised purpose (training ZIPs + PDF buffers only)
ocr-volume-init one-shot service added: runs chown -R 1000:1000 + mkdir -p /app/cache/.tmp before ocr-service starts. Replaces the manual docker run --rm alpine chown from 2026-05-18 with a permanent infrastructure-as-code guarantee. Addresses AC #6 (fresh volume works) and the volume-bootstrap concern from Markus/Tobias/Elicit (Decision A — same PR).
ocr-service.depends_on added with condition: service_completed_successfully

docs/adr/021-tmpdir-persistent-volume-staging.md — New ADR documenting the decision, consequences, and rejected alternatives (Approach B/C). Per Markus's recommendation.

ocr-service/README.md — HF_HOME, XDG_CACHE_HOME, TORCH_HOME, TMPDIR rows added to the environment variables table.

ocr-service/CLAUDE.md — LLM reminder added: TMPDIR must stay on the cache volume; references ADR-021.

ocr-service/test_tmpdir.py — Three tests:

test_tempfile_uses_tmpdir_when_set — proves Python honours TMPDIR; monkeypatch-based, runs in CI
test_entrypoint_creates_tmpdir — TDD regression test: was RED before the entrypoint change, GREEN after. Runs entrypoint.sh with stub python3/uvicorn, asserts the directory gets created.
test_tmpdir_is_inside_persistent_cache_volume — guards against accidental reversion; skipped outside Docker, runs inside the container where TMPDIR=/app/cache/.tmp is set.

.gitea/workflows/ci.yml — test_tmpdir.py added to the OCR CI run (stdlib-only, no ML stack required).

AC verification

Criterion	Status
Guided OCR succeeds after fresh container start (Surya downloads to `/app/cache`)	✅ TMPDIR routes staging to SSD; path exists via volume-init + entrypoint mkdir
Training endpoint (`/train-sender` 20–50 image ZIPs) still works	✅ Training ZIPs still unpack to `/tmp` (512 MB, RAM-fast); TMPDIR redirect only affects model staging
`read_only: true` retained	✅ unchanged in both compose files
`cap_drop: [ALL]` retained	✅ unchanged
Non-root `ocr` user (uid 1000) retained	✅ `ocr-volume-init` sets ownership to 1000:1000 before service starts
Both compose files in sync	✅ same commit, same change in both
Fresh `ocr_cache` volume works (AC #6)	✅ `ocr-volume-init` creates `chown + mkdir`; entrypoint `mkdir -p` as second layer

## Implementation complete — branch `feat/issue-614-tmpdir-persistent-volume` Approach A implemented. All acceptance criteria addressed. PR forthcoming. ### Commits | Commit | Description | |---|---| | `09a04343` | `build(ocr): set ENV TMPDIR=/app/cache/.tmp so docker run uses SSD staging` | | `240b373f` | `fix(ocr): create TMPDIR on startup and clear day-old orphans` | | `1f7b08b7` | `fix(ocr): add TMPDIR env var and ocr-volume-init service to compose files` | | `cfd49ff6` | `docs(ocr): document TMPDIR convention and add ADR-021` | ### What was implemented **`ocr-service/Dockerfile`** — `ENV TMPDIR=/app/cache/.tmp` ensures that even bare `docker run` (without compose) uses SSD staging. Without this, any tool reading `TMPDIR` before the compose env layer is applied would fall back to `/tmp`. **`ocr-service/entrypoint.sh`** — Two lines added before `ensure_blla_model.py`: - `mkdir -p "${TMPDIR:-/tmp}"` — creates `/app/cache/.tmp` on fresh volumes (idempotent on existing ones) - `find "${TMPDIR:-/tmp}" -mindepth 1 -mtime +1 -delete` — clears orphaned fragments from prior `docker kill` during model downloads (Nora's cross-job leakage concern) **`docker-compose.yml` and `docker-compose.prod.yml`** — Both files updated in the same commit to prevent the silent drift that already existed: - `TMPDIR: /app/cache/.tmp` with threat-model comment added to `ocr-service.environment` - `/tmp` tmpfs comment updated to reflect its revised purpose (training ZIPs + PDF buffers only) - `ocr-volume-init` one-shot service added: runs `chown -R 1000:1000` + `mkdir -p /app/cache/.tmp` before `ocr-service` starts. Replaces the manual `docker run --rm alpine chown` from 2026-05-18 with a permanent infrastructure-as-code guarantee. Addresses AC #6 (fresh volume works) and the volume-bootstrap concern from Markus/Tobias/Elicit (Decision A — same PR). - `ocr-service.depends_on` added with `condition: service_completed_successfully` **`docs/adr/021-tmpdir-persistent-volume-staging.md`** — New ADR documenting the decision, consequences, and rejected alternatives (Approach B/C). Per Markus's recommendation. **`ocr-service/README.md`** — `HF_HOME`, `XDG_CACHE_HOME`, `TORCH_HOME`, `TMPDIR` rows added to the environment variables table. **`ocr-service/CLAUDE.md`** — LLM reminder added: TMPDIR must stay on the cache volume; references ADR-021. **`ocr-service/test_tmpdir.py`** — Three tests: 1. `test_tempfile_uses_tmpdir_when_set` — proves Python honours `TMPDIR`; monkeypatch-based, runs in CI 2. `test_entrypoint_creates_tmpdir` — TDD regression test: was **RED** before the entrypoint change, **GREEN** after. Runs entrypoint.sh with stub python3/uvicorn, asserts the directory gets created. 3. `test_tmpdir_is_inside_persistent_cache_volume` — guards against accidental reversion; skipped outside Docker, runs inside the container where `TMPDIR=/app/cache/.tmp` is set. **`.gitea/workflows/ci.yml`** — `test_tmpdir.py` added to the OCR CI run (stdlib-only, no ML stack required). ### AC verification | Criterion | Status | |---|---| | Guided OCR succeeds after fresh container start (Surya downloads to `/app/cache`) | ✅ TMPDIR routes staging to SSD; path exists via volume-init + entrypoint mkdir | | Training endpoint (`/train-sender` 20–50 image ZIPs) still works | ✅ Training ZIPs still unpack to `/tmp` (512 MB, RAM-fast); TMPDIR redirect only affects model staging | | `read_only: true` retained | ✅ unchanged in both compose files | | `cap_drop: [ALL]` retained | ✅ unchanged | | Non-root `ocr` user (uid 1000) retained | ✅ `ocr-volume-init` sets ownership to 1000:1000 before service starts | | Both compose files in sync | ✅ same commit, same change in both | | Fresh `ocr_cache` volume works (AC #6) | ✅ `ocr-volume-init` creates `chown + mkdir`; entrypoint `mkdir -p` as second layer |

marcel referenced a pull request that will close this issue

2026-05-18 10:59:45 +02:00

fix(ocr): route Surya model staging to SSD via TMPDIR + add volume-init service #615

marcel referenced this issue

2026-05-18 11:06:54 +02:00

fix(ocr): route Surya model staging to SSD via TMPDIR + add volume-init service #615

marcel referenced this issue

2026-05-18 11:28:44 +02:00

fix(ocr): route Surya model staging to SSD via TMPDIR + add volume-init service #615

marcel closed this issue

2026-05-18 11:32:37 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#614

bug(ocr): /tmp tmpfs too small for Surya model download — guided OCR fails with ENOSPC on staging #614

Context

Symptom

Root cause

Approaches under consideration

Approach A — Redirect TMPDIR to the disk-backed cache volume (recommended)

Approach B — Enlarge /tmp tmpfs to 4 GB

Approach C — Apply both

Acceptance criteria

Test plan

Files to touch

Out of scope (separate tickets)

Related work / history

🏛️ Markus Keller — Senior Application Architect

Observations

Recommendations

Open Decisions

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

Recommendations

Open Decisions

🛠️ Tobias Wendt — DevOps & Platform Engineer

Observations

Recommendations

Open Decisions

🔐 Nora "NullX" Steiner — Application Security Engineer

Observations

Recommendations

Open Decisions

🧪 Sara Holt — Senior QA Engineer

Observations

Recommendations

Open Decisions

📋 "Elicit" — Requirements Engineer

Observations

Recommendations

Open Decisions

🎨 Leonie Voss — UX & Accessibility Lead

🗳️ Decision Queue — Action Required

Infrastructure / Scope

Implementation complete — branch feat/issue-614-tmpdir-persistent-volume

Commits

What was implemented

AC verification

Approach A — Redirect `TMPDIR` to the disk-backed cache volume (recommended)

Approach B — Enlarge `/tmp` tmpfs to 4 GB

Implementation complete — branch `feat/issue-614-tmpdir-persistent-volume`