fix(ocr): route Surya model staging to SSD via TMPDIR + add volume-init service #615

Merged
marcel merged 10 commits from feat/issue-614-tmpdir-persistent-volume into main 2026-05-18 11:32:37 +02:00
Owner

Summary

Fixes the ENOSPC failure on staging where Surya's 1.34 GB text_recognition model exhausted the 512 MB /tmp tmpfs during download. Root cause: tempfile.TemporaryDirectory() defaults to /tmp; the tmpfs was sized for training ZIPs, not GB-scale model staging.

  • Set TMPDIR=/app/cache/.tmp in Dockerfile + both compose files — routes all model staging to the SSD-backed cache volume; /tmp keeps its 512 MB RAM budget for training ZIPs
  • Add ocr-volume-init one-shot service to both compose files — automating the manual chown -R 1000:1000 from 2026-05-18 and creating /app/cache/.tmp on fresh volumes (resolves AC #6)
  • Add mkdir -p "$TMPDIR" + day-old orphan cleanup to entrypoint.sh as a fallback layer
  • Write ADR-021 + update README env var table + add CLAUDE.md LLM reminder
  • Add test_tmpdir.py with a proper red/green TDD regression test for the entrypoint mkdir

Closes #614

Test plan

  • All CI OCR tests pass (test_tmpdir.py test_spell_check.py test_confidence.py test_sender_registry.py)
  • test_entrypoint_creates_tmpdir was confirmed RED before entrypoint change, GREEN after
  • docker compose config output contains TMPDIR: /app/cache/.tmp in both dev and prod
  • Staging dry-run: docker volume rm archiv-staging_ocr-cache → restart → trigger guided OCR → Surya model downloads fully, inference returns text
  • docker exec archiv-ocr stat -c %u /app/cache/.tmp returns 1000 (uid of ocr user)

🤖 Generated with Claude Code

## Summary Fixes the ENOSPC failure on staging where Surya's 1.34 GB `text_recognition` model exhausted the 512 MB `/tmp` tmpfs during download. Root cause: `tempfile.TemporaryDirectory()` defaults to `/tmp`; the tmpfs was sized for training ZIPs, not GB-scale model staging. - Set `TMPDIR=/app/cache/.tmp` in Dockerfile + both compose files — routes all model staging to the SSD-backed cache volume; `/tmp` keeps its 512 MB RAM budget for training ZIPs - Add `ocr-volume-init` one-shot service to both compose files — automating the manual `chown -R 1000:1000` from 2026-05-18 and creating `/app/cache/.tmp` on fresh volumes (resolves AC #6) - Add `mkdir -p "$TMPDIR"` + day-old orphan cleanup to `entrypoint.sh` as a fallback layer - Write ADR-021 + update README env var table + add CLAUDE.md LLM reminder - Add `test_tmpdir.py` with a proper red/green TDD regression test for the entrypoint mkdir Closes #614 ## Test plan - [ ] All CI OCR tests pass (`test_tmpdir.py test_spell_check.py test_confidence.py test_sender_registry.py`) - [ ] `test_entrypoint_creates_tmpdir` was confirmed RED before entrypoint change, GREEN after - [ ] `docker compose config` output contains `TMPDIR: /app/cache/.tmp` in both dev and prod - [ ] Staging dry-run: `docker volume rm archiv-staging_ocr-cache` → restart → trigger guided OCR → Surya model downloads fully, inference returns text - [ ] `docker exec archiv-ocr stat -c %u /app/cache/.tmp` returns `1000` (uid of `ocr` user) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
marcel added 4 commits 2026-05-18 10:59:45 +02:00
Without this, running the image outside compose loses the TMPDIR redirect
and Surya model downloads fall back to the 512 MB /tmp tmpfs (ENOSPC).
See issue #614, ADR-021.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On a fresh ocr_cache volume /app/cache/.tmp does not exist yet. The mkdir
ensures the first Surya model download can proceed without ENOSPC on the
512 MB /tmp tmpfs. The find cleanup removes fragments left by docker-kill
mid-download, preventing cross-job ground-truth leakage.

Fixes #614. See ADR-021.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TMPDIR=/app/cache/.tmp routes Surya model staging to the SSD-backed cache
volume instead of the 512 MB /tmp tmpfs. The ocr-volume-init one-shot service
runs first to ensure correct ownership (uid 1000) and creates /app/cache/.tmp
on fresh volumes, making AC #6 ("fresh volume still works") a permanent
infrastructure-as-code guarantee rather than a manual chown step.

Both docker-compose.yml and docker-compose.prod.yml are updated in the same
commit to prevent the silent drift that occurred with the 512 MB tmpfs comment.

Fixes #614. See ADR-021.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docs(ocr): document TMPDIR convention and add ADR-021
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m7s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 3m7s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 18s
CI / Compose Bucket Idempotency (pull_request) Successful in 59s
cfd49ff69e
- ocr-service/README.md: add HF_HOME, XDG_CACHE_HOME, TORCH_HOME, TMPDIR rows
  to the environment variables table
- ocr-service/CLAUDE.md: LLM reminder — TMPDIR must stay on the cache volume
- docs/adr/021-tmpdir-persistent-volume-staging.md: records the decision,
  trade-offs, and rejected alternatives (Approach B / C) for issue #614
- ci.yml: add test_tmpdir.py to the OCR CI run (stdlib-only tests, no ML stack)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

🏗️ Markus Keller (@mkeller) — Senior Application Architect

Verdict: ⚠️ Approved with concerns


Blockers

New Docker service missing from documentation

ocr-volume-init is a new service in both compose files. Per our doc-update rules, this triggers:

  • docs/architecture/c4/l2-containers.puml — init container should appear
  • docs/DEPLOYMENT.md — needs a note about the bootstrap step and how to verify it succeeded (docker logs archiv-ocr-volume-init)

ADR-021 covers the why — the C4 container diagram and deployment doc cover the what, and they're absent.

Counter-argument: ocr-volume-init is a one-shot infrastructure step, not a software system with a port or API surface — closer to a docker run call than an application container. If that framing is accepted, a single sentence in DEPLOYMENT.md ("volume ownership is bootstrapped by ocr-volume-init, see ADR-021") would be the minimum, and the C4 update could be waived. Decide and document the reasoning — don't leave the ambiguity open.


What's done well

  • ADR-021 is exactly right: context, decision, two alternatives explicitly rejected with quantified reasoning (OOMKill risk at 6g mem_limit, same-filesystem rename(2) benefit). This is the load-bearing reference document this change needs.
  • Defense-in-depth layering is architecturally sound: ocr-volume-init creates /app/cache/.tmp, entrypoint.sh creates it again as a fallback. Neither layer depends solely on the other — a bare docker run still works.
  • depends_on: condition: service_completed_successfully is the correct dependency declaration for a one-shot init pattern.
  • The CLAUDE.md LLM reminder is a good architectural guardrail against future regressions reverting this change.
## 🏗️ Markus Keller (@mkeller) — Senior Application Architect **Verdict: ⚠️ Approved with concerns** --- ### Blockers **New Docker service missing from documentation** `ocr-volume-init` is a new service in both compose files. Per our doc-update rules, this triggers: - `docs/architecture/c4/l2-containers.puml` — init container should appear - `docs/DEPLOYMENT.md` — needs a note about the bootstrap step and how to verify it succeeded (`docker logs archiv-ocr-volume-init`) ADR-021 covers the *why* — the C4 container diagram and deployment doc cover the *what*, and they're absent. **Counter-argument**: `ocr-volume-init` is a one-shot infrastructure step, not a software system with a port or API surface — closer to a `docker run` call than an application container. If that framing is accepted, a single sentence in `DEPLOYMENT.md` ("volume ownership is bootstrapped by `ocr-volume-init`, see ADR-021") would be the minimum, and the C4 update could be waived. Decide and document the reasoning — don't leave the ambiguity open. --- ### What's done well - **ADR-021 is exactly right**: context, decision, two alternatives explicitly rejected with quantified reasoning (OOMKill risk at 6g `mem_limit`, same-filesystem `rename(2)` benefit). This is the load-bearing reference document this change needs. - **Defense-in-depth layering is architecturally sound**: `ocr-volume-init` creates `/app/cache/.tmp`, `entrypoint.sh` creates it again as a fallback. Neither layer depends solely on the other — a bare `docker run` still works. - `depends_on: condition: service_completed_successfully` is the correct dependency declaration for a one-shot init pattern. - The CLAUDE.md LLM reminder is a good architectural guardrail against future regressions reverting this change.
Author
Owner

👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer

Verdict: Approved


Suggestions (non-blocking)

entrypoint.sh: misleading comment

# Orphaned fragments from prior docker-kill during model downloads are cleared
# on startup to prevent cross-job ground-truth leakage (Surya staging files).

"Cross-job ground-truth leakage" is NLP/ML terminology for labeled training data bleeding between runs. The orphaned files here are binary model staging blobs, not ground-truth labels. A cleaner phrasing: # Remove stale partial downloads left by a previous docker-kill. Minor, but comments should reveal intent precisely.

test_entrypoint_creates_tmpdir: exit code not asserted

result = subprocess.run(["bash", _ENTRYPOINT], env=env, capture_output=True, text=True)
assert custom_tmp.exists(), ...
# ← result.returncode not checked

If the entrypoint exits non-zero after creating TMPDIR, the test still passes. Given the python3 and uvicorn stubs both exit 0, this won't bite today — but a future entrypoint change could introduce a silent regression. Add:

assert result.returncode == 0, (
    f"entrypoint.sh exited {result.returncode}\n"
    f"stdout: {result.stdout}\nstderr: {result.stderr}"
)

find -mtime +1 -delete behavior has no test

The orphan cleanup is exercised by the entrypoint test, but the behavior — old files deleted, new files preserved — has no dedicated test. os.utime() can fabricate an old mtime without mocking time; this is a stdlib-only test. Not a blocker, but this is the core of the cleanup feature.


What's done well

  • The stub binary approach in test_entrypoint_creates_tmpdir (fake python3 + uvicorn) is clean and keeps the test out of the ML stack dependency.
  • ${TMPDIR:-/tmp} fallback correctly handles the bare docker run case where the env var may not be set.
  • test_zipslip_still_anchors_under_custom_tmpdir is exactly the right regression test: verifying that relocating the extraction directory doesn't break os.path.realpath() anchoring. Testing the security-adjacent consequence of an infrastructure change is good practice.
  • test_tempfile_uses_tmpdir_when_set documents the Python mechanism the entire fix relies on — if Python ever changed this behavior, the test catches it before production.
## 👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer **Verdict: ✅ Approved** --- ### Suggestions (non-blocking) **`entrypoint.sh`: misleading comment** ```bash # Orphaned fragments from prior docker-kill during model downloads are cleared # on startup to prevent cross-job ground-truth leakage (Surya staging files). ``` "Cross-job ground-truth leakage" is NLP/ML terminology for labeled training data bleeding between runs. The orphaned files here are binary model staging blobs, not ground-truth labels. A cleaner phrasing: `# Remove stale partial downloads left by a previous docker-kill`. Minor, but comments should reveal intent precisely. **`test_entrypoint_creates_tmpdir`: exit code not asserted** ```python result = subprocess.run(["bash", _ENTRYPOINT], env=env, capture_output=True, text=True) assert custom_tmp.exists(), ... # ← result.returncode not checked ``` If the entrypoint exits non-zero after creating TMPDIR, the test still passes. Given the `python3` and `uvicorn` stubs both `exit 0`, this won't bite today — but a future entrypoint change could introduce a silent regression. Add: ```python assert result.returncode == 0, ( f"entrypoint.sh exited {result.returncode}\n" f"stdout: {result.stdout}\nstderr: {result.stderr}" ) ``` **`find -mtime +1 -delete` behavior has no test** The orphan cleanup is exercised by the entrypoint test, but the *behavior* — old files deleted, new files preserved — has no dedicated test. `os.utime()` can fabricate an old mtime without mocking time; this is a stdlib-only test. Not a blocker, but this is the core of the cleanup feature. --- ### What's done well - The stub binary approach in `test_entrypoint_creates_tmpdir` (fake `python3` + `uvicorn`) is clean and keeps the test out of the ML stack dependency. - `${TMPDIR:-/tmp}` fallback correctly handles the bare `docker run` case where the env var may not be set. - `test_zipslip_still_anchors_under_custom_tmpdir` is exactly the right regression test: verifying that relocating the extraction directory doesn't break `os.path.realpath()` anchoring. Testing the security-adjacent consequence of an infrastructure change is good practice. - `test_tempfile_uses_tmpdir_when_set` documents the Python mechanism the entire fix relies on — if Python ever changed this behavior, the test catches it before production.
Author
Owner

🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns


Blockers

alpine:3 is unpinned — both compose files

ocr-volume-init:
    image: alpine:3   # ← unpinned

alpine:3 is a moving pointer — currently resolves to 3.21 but bumps with every Alpine minor release. A docker compose pull in 6 months pulls alpine:3.22 with potentially different busybox behavior. Builds are non-reproducible and rollbacks are impossible. Fix:

image: alpine:3.21   # or alpine:3.21.3 for full reproducibility

Add a Renovate rule to auto-bump the patch version:

{ "matchPackageNames": ["alpine"], "enabled": true }

DEPLOYMENT.md not updated

ocr-volume-init is a new deployment step operators need to understand. When deploying to staging or production on a fresh host, they should know:

  • What ocr-volume-init does and why it runs first
  • How to verify it succeeded: docker logs archiv-ocr-volume-init
  • What failure looks like (non-zero exit, chown denied on a read-only volume)

One paragraph in DEPLOYMENT.md is enough; ADR-021 is the detail reference.


Concerns (non-blocking)

Init container joins archiv-net unnecessarily

The init container only needs volume access, not network access. By default, Compose adds it to the project network. Explicitly exclude it:

ocr-volume-init:
    image: alpine:3.21
    networks: []     # volumes only — no network needed
    ...

Least-privilege at the container level costs nothing and reduces the network's exposed surface by one container.

Duplication between compose files

The ocr-volume-init block is copy-pasted between docker-compose.yml (volume names: ocr_models, ocr_cache) and docker-compose.prod.yml (ocr-models, ocr-cache). If the init command changes, it requires two identical edits. Acceptable given the existing overlay structure, but worth noting for the next time this block is touched.


What's done well

  • restart: "no" is correct — init containers must not restart. A failed chown looping forever would be worse than a failed start.
  • depends_on: condition: service_completed_successfully is the right gate; service_started would allow a race.
  • The find -mtime +1 -delete 2>/dev/null || true pattern is safe: -mindepth 1 prevents deleting TMPDIR itself, 2>/dev/null handles a missing directory gracefully, || true prevents set -e from aborting the entrypoint.
  • ci.yml correctly appends test_tmpdir.py to the lightweight test command — no new dependencies, no ML stack required.
## 🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** --- ### Blockers **`alpine:3` is unpinned — both compose files** ```yaml ocr-volume-init: image: alpine:3 # ← unpinned ``` `alpine:3` is a moving pointer — currently resolves to `3.21` but bumps with every Alpine minor release. A `docker compose pull` in 6 months pulls `alpine:3.22` with potentially different busybox behavior. Builds are non-reproducible and rollbacks are impossible. Fix: ```yaml image: alpine:3.21 # or alpine:3.21.3 for full reproducibility ``` Add a Renovate rule to auto-bump the patch version: ```json { "matchPackageNames": ["alpine"], "enabled": true } ``` **`DEPLOYMENT.md` not updated** `ocr-volume-init` is a new deployment step operators need to understand. When deploying to staging or production on a fresh host, they should know: - What `ocr-volume-init` does and why it runs first - How to verify it succeeded: `docker logs archiv-ocr-volume-init` - What failure looks like (non-zero exit, `chown` denied on a read-only volume) One paragraph in `DEPLOYMENT.md` is enough; ADR-021 is the detail reference. --- ### Concerns (non-blocking) **Init container joins `archiv-net` unnecessarily** The init container only needs volume access, not network access. By default, Compose adds it to the project network. Explicitly exclude it: ```yaml ocr-volume-init: image: alpine:3.21 networks: [] # volumes only — no network needed ... ``` Least-privilege at the container level costs nothing and reduces the network's exposed surface by one container. **Duplication between compose files** The `ocr-volume-init` block is copy-pasted between `docker-compose.yml` (volume names: `ocr_models`, `ocr_cache`) and `docker-compose.prod.yml` (`ocr-models`, `ocr-cache`). If the init command changes, it requires two identical edits. Acceptable given the existing overlay structure, but worth noting for the next time this block is touched. --- ### What's done well - `restart: "no"` is correct — init containers must not restart. A failed `chown` looping forever would be worse than a failed start. - `depends_on: condition: service_completed_successfully` is the right gate; `service_started` would allow a race. - The `find -mtime +1 -delete 2>/dev/null || true` pattern is safe: `-mindepth 1` prevents deleting TMPDIR itself, `2>/dev/null` handles a missing directory gracefully, `|| true` prevents `set -e` from aborting the entrypoint. - `ci.yml` correctly appends `test_tmpdir.py` to the lightweight test command — no new dependencies, no ML stack required.
Author
Owner

🔒 Nora "NullX" Steiner — Application Security Engineer

Verdict: Approved


What I checked

  • TMPDIR injection risk: TMPDIR is hard-coded in the Dockerfile and compose files (/app/cache/.tmp) — not user-controlled, no injection vector.
  • find "${TMPDIR:-/tmp}" -mindepth 1 -mtime +1 -delete: -mindepth 1 prevents deleting TMPDIR root; runs as uid 1000 (the ocr user), so it can only delete files that user owns; 2>/dev/null || true means a failed find cannot crash the container. No path traversal risk.
  • chown -R 1000:1000 /app/cache /app/models: Runs as root in a one-shot init container, on internal Docker volumes not reachable by external actors. Acceptable.
  • ZIP Slip: _validate_zip_entry() uses os.path.realpath() anchored to the extraction directory — relocating that directory to /app/cache/.tmp doesn't change the anchor logic. Correctly documented in ADR-021.

Concerns (non-blocking)

Security regression test test_zipslip_still_anchors_under_custom_tmpdir skips in CI

@pytest.mark.skipif(not HAS_MAIN, reason="requires full ML stack (not available in CI)")
def test_zipslip_still_anchors_under_custom_tmpdir(tmp_path):

HAS_MAIN = False in CI because from main import _validate_zip_entry fails when the ML stack isn't installed. However, _validate_zip_entry is a pure path-manipulation function — it has no ML dependency itself. The ImportError is a transitive import through main.py's module-level ML imports.

Options (neither is a blocker):

  1. Extract _validate_zip_entry to a utils.py with no ML imports → test runs in CI, security regression is continuously enforced.
  2. Accept the skip, noting that _validate_zip_entry is already covered by other CI tests that pre-date this change.

Option 1 is the stronger guarantee. Option 2 is acceptable if the existing coverage is verified.

alpine:3 supply chain concern (shared with @tobiwendt)

Unpinned tags make the init container's exact behavior non-reproducible and introduce a low-probability supply chain risk. Pin to alpine:3.21.


What's done well

  • ADR-021 explicitly states that ZIP Slip protection is unaffected by the TMPDIR relocation. Documenting the non-impact of a security-adjacent change is good practice — it prevents a future auditor from treating this as a risk that was silently introduced.
  • The test_zipslip_still_anchors_under_custom_tmpdir test exists and correctly exercises the boundary case — the intent is right even if CI coverage is incomplete.
  • TMPDIR cannot be influenced by any attacker-reachable input path. The security posture of the OCR service is unchanged.
## 🔒 Nora "NullX" Steiner — Application Security Engineer **Verdict: ✅ Approved** --- ### What I checked - **TMPDIR injection risk**: TMPDIR is hard-coded in the Dockerfile and compose files (`/app/cache/.tmp`) — not user-controlled, no injection vector. - **`find "${TMPDIR:-/tmp}" -mindepth 1 -mtime +1 -delete`**: `-mindepth 1` prevents deleting TMPDIR root; runs as uid `1000` (the `ocr` user), so it can only delete files that user owns; `2>/dev/null || true` means a failed find cannot crash the container. No path traversal risk. - **`chown -R 1000:1000 /app/cache /app/models`**: Runs as root in a one-shot init container, on internal Docker volumes not reachable by external actors. Acceptable. - **ZIP Slip**: `_validate_zip_entry()` uses `os.path.realpath()` anchored to the extraction directory — relocating that directory to `/app/cache/.tmp` doesn't change the anchor logic. Correctly documented in ADR-021. --- ### Concerns (non-blocking) **Security regression test `test_zipslip_still_anchors_under_custom_tmpdir` skips in CI** ```python @pytest.mark.skipif(not HAS_MAIN, reason="requires full ML stack (not available in CI)") def test_zipslip_still_anchors_under_custom_tmpdir(tmp_path): ``` `HAS_MAIN = False` in CI because `from main import _validate_zip_entry` fails when the ML stack isn't installed. However, `_validate_zip_entry` is a pure path-manipulation function — it has no ML dependency itself. The `ImportError` is a transitive import through `main.py`'s module-level ML imports. Options (neither is a blocker): 1. Extract `_validate_zip_entry` to a `utils.py` with no ML imports → test runs in CI, security regression is continuously enforced. 2. Accept the skip, noting that `_validate_zip_entry` is already covered by other CI tests that pre-date this change. Option 1 is the stronger guarantee. Option 2 is acceptable if the existing coverage is verified. **`alpine:3` supply chain concern** (shared with @tobiwendt) Unpinned tags make the init container's exact behavior non-reproducible and introduce a low-probability supply chain risk. Pin to `alpine:3.21`. --- ### What's done well - ADR-021 explicitly states that ZIP Slip protection is unaffected by the TMPDIR relocation. Documenting the non-impact of a security-adjacent change is good practice — it prevents a future auditor from treating this as a risk that was silently introduced. - The `test_zipslip_still_anchors_under_custom_tmpdir` test exists and correctly exercises the boundary case — the intent is right even if CI coverage is incomplete. - TMPDIR cannot be influenced by any attacker-reachable input path. The security posture of the OCR service is unchanged.
Author
Owner

🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist

Verdict: ⚠️ Approved with concerns


Concerns (non-blocking)

Orphan cleanup behavior is untested

find "${TMPDIR:-/tmp}" -mindepth 1 -mtime +1 -delete is exercised by test_entrypoint_creates_tmpdir but its behavior is not tested:

  • Does it actually delete files older than 1 day?
  • Does it preserve files newer than 1 day?
  • Does it handle an empty TMPDIR gracefully?

This is the core logic of the cleanup feature. os.utime() can fabricate an old mtime without mocking system time — a stdlib-only test:

def test_entrypoint_removes_day_old_orphans(tmp_path):
    old_file = tmp_path / "stale_model.bin"
    old_file.write_bytes(b"partial")
    # backdate to 2 days ago
    old_mtime = time.time() - 2 * 24 * 3600
    os.utime(old_file, (old_mtime, old_mtime))

    run_entrypoint(tmp_path)  # helper wrapping the subprocess call

    assert not old_file.exists(), "day-old orphan should be deleted"

def test_entrypoint_preserves_fresh_files(tmp_path):
    fresh_file = tmp_path / "active_model.bin"
    fresh_file.write_bytes(b"in-progress")
    # no utime — file mtime is now

    run_entrypoint(tmp_path)

    assert fresh_file.exists(), "recent file should be preserved"

test_entrypoint_creates_tmpdir: exit code not asserted

result = subprocess.run(["bash", _ENTRYPOINT], env=env, capture_output=True, text=True)
assert custom_tmp.exists(), ...
# result.returncode never checked

A test that passes even when the process exits with code 1 is a test that can mask regressions. First assertion should be:

assert result.returncode == 0, f"entrypoint.sh exited {result.returncode}\nstderr: {result.stderr}"

test_tmpdir_is_inside_persistent_cache_volume is permanently skipped in CI without documentation

The skip is intentional and correct (deployment contract test, not a unit test). But there's no comment explaining how this test is expected to run — e.g., inside the OCR container during a staging smoke test. Without that, a future developer may wonder if this test is dead code. A one-line comment like # run: docker exec archiv-ocr python -m pytest test_tmpdir.py::test_tmpdir_is_inside_persistent_cache_volume -v would close the loop.


TDD evidence

The PR description states: "test_entrypoint_creates_tmpdir was confirmed RED before entrypoint change, GREEN after." That's the discipline this project requires. ✓


What's done well

  • Three-test structure maps well to the test pyramid for an infrastructure change: (1) mechanism contract (test_tempfile_uses_tmpdir_when_set), (2) entrypoint behavior (test_entrypoint_creates_tmpdir), (3) deployment contract (test_tmpdir_is_inside_persistent_cache_volume).
  • @pytest.mark.skipif conditions are explicit and include the reason — no guesswork about why they skip.
  • test_tmpdir.py added to the CI test command in ci.yml closes the coverage gap immediately. ✓
  • Stub binary approach (fake python3 + uvicorn) keeps the test in the lightweight CI tier, no ML stack dependency.
## 🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist **Verdict: ⚠️ Approved with concerns** --- ### Concerns (non-blocking) **Orphan cleanup behavior is untested** `find "${TMPDIR:-/tmp}" -mindepth 1 -mtime +1 -delete` is exercised by `test_entrypoint_creates_tmpdir` but its *behavior* is not tested: - Does it actually delete files older than 1 day? - Does it preserve files newer than 1 day? - Does it handle an empty TMPDIR gracefully? This is the core logic of the cleanup feature. `os.utime()` can fabricate an old mtime without mocking system time — a stdlib-only test: ```python def test_entrypoint_removes_day_old_orphans(tmp_path): old_file = tmp_path / "stale_model.bin" old_file.write_bytes(b"partial") # backdate to 2 days ago old_mtime = time.time() - 2 * 24 * 3600 os.utime(old_file, (old_mtime, old_mtime)) run_entrypoint(tmp_path) # helper wrapping the subprocess call assert not old_file.exists(), "day-old orphan should be deleted" def test_entrypoint_preserves_fresh_files(tmp_path): fresh_file = tmp_path / "active_model.bin" fresh_file.write_bytes(b"in-progress") # no utime — file mtime is now run_entrypoint(tmp_path) assert fresh_file.exists(), "recent file should be preserved" ``` **`test_entrypoint_creates_tmpdir`: exit code not asserted** ```python result = subprocess.run(["bash", _ENTRYPOINT], env=env, capture_output=True, text=True) assert custom_tmp.exists(), ... # result.returncode never checked ``` A test that passes even when the process exits with code 1 is a test that can mask regressions. First assertion should be: ```python assert result.returncode == 0, f"entrypoint.sh exited {result.returncode}\nstderr: {result.stderr}" ``` **`test_tmpdir_is_inside_persistent_cache_volume` is permanently skipped in CI without documentation** The skip is intentional and correct (deployment contract test, not a unit test). But there's no comment explaining *how* this test is expected to run — e.g., inside the OCR container during a staging smoke test. Without that, a future developer may wonder if this test is dead code. A one-line comment like `# run: docker exec archiv-ocr python -m pytest test_tmpdir.py::test_tmpdir_is_inside_persistent_cache_volume -v` would close the loop. --- ### TDD evidence The PR description states: *"test_entrypoint_creates_tmpdir was confirmed RED before entrypoint change, GREEN after."* That's the discipline this project requires. ✓ --- ### What's done well - Three-test structure maps well to the test pyramid for an infrastructure change: (1) mechanism contract (`test_tempfile_uses_tmpdir_when_set`), (2) entrypoint behavior (`test_entrypoint_creates_tmpdir`), (3) deployment contract (`test_tmpdir_is_inside_persistent_cache_volume`). - `@pytest.mark.skipif` conditions are explicit and include the reason — no guesswork about why they skip. - `test_tmpdir.py` added to the CI test command in `ci.yml` closes the coverage gap immediately. ✓ - Stub binary approach (fake `python3` + `uvicorn`) keeps the test in the lightweight CI tier, no ML stack dependency.
Author
Owner

📋 Elicit — Requirements Engineer

Verdict: Approved


Requirements coverage assessment

Issue #614 ACs fully addressed

The PR description maps directly to the acceptance criteria from #614. AC #6 ("creating /app/cache/.tmp on fresh volumes") is explicitly resolved via ocr-volume-init and the entrypoint fallback. All five test plan items are specific and testable — no vague "it works" language.

NFR completeness

The fix correctly preserves two competing NFRs:

  • Security NFR: /tmp retains its 512 MB DoS cap for attacker-influenceable training endpoints (post-auth only, but still intentional).
  • Operational NFR: Surya model downloads complete on available SSD storage.

ADR-021 documents both explicitly, including the threat model for why the 512 MB cap must stay.

Alternatives evaluated

Approach B (enlarge /tmp to 4 GB) is rejected with quantified reasoning: combined Surya resident set + tmpfs would trigger OOMKill on CX32 hosts with OCR_MEM_LIMIT=6g. Approach C (belt-and-suspenders) is rejected as unnecessary complexity. This is good requirements discipline — non-chosen alternatives with documented rejection reasons prevent the same debate from recurring.

Scope discipline

The PR is tightly scoped to the root cause (TMPDIR routing). No scope creep identified.


Minor observation (non-blocking)

The staging dry-run acceptance criterion — "docker volume rm archiv-staging_ocr-cache → restart → trigger guided OCR → Surya model downloads fully" — is a manual test. For a solo-developer workflow, documenting the expected observable outcome (specific log lines, or stat -c %u /app/cache/.tmp returning 1000) would make future regression verification faster and less ambiguous. The PR already includes docker exec archiv-ocr stat -c %u /app/cache/.tmp as AC #5 — extending that pattern to the download verification step would complete the picture.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** --- ### Requirements coverage assessment **Issue #614 ACs fully addressed** The PR description maps directly to the acceptance criteria from #614. AC #6 ("creating `/app/cache/.tmp` on fresh volumes") is explicitly resolved via `ocr-volume-init` and the entrypoint fallback. All five test plan items are specific and testable — no vague "it works" language. **NFR completeness** The fix correctly preserves two competing NFRs: - **Security NFR**: `/tmp` retains its 512 MB DoS cap for attacker-influenceable training endpoints (post-auth only, but still intentional). - **Operational NFR**: Surya model downloads complete on available SSD storage. ADR-021 documents both explicitly, including the threat model for why the 512 MB cap must stay. **Alternatives evaluated** Approach B (enlarge `/tmp` to 4 GB) is rejected with quantified reasoning: combined Surya resident set + tmpfs would trigger OOMKill on CX32 hosts with `OCR_MEM_LIMIT=6g`. Approach C (belt-and-suspenders) is rejected as unnecessary complexity. This is good requirements discipline — non-chosen alternatives with documented rejection reasons prevent the same debate from recurring. **Scope discipline** The PR is tightly scoped to the root cause (TMPDIR routing). No scope creep identified. --- ### Minor observation (non-blocking) The staging dry-run acceptance criterion — *"docker volume rm archiv-staging_ocr-cache → restart → trigger guided OCR → Surya model downloads fully"* — is a manual test. For a solo-developer workflow, documenting the expected observable outcome (specific log lines, or `stat -c %u /app/cache/.tmp` returning `1000`) would make future regression verification faster and less ambiguous. The PR already includes `docker exec archiv-ocr stat -c %u /app/cache/.tmp` as AC #5 — extending that pattern to the download verification step would complete the picture.
Author
Owner

🎨 Leonie Voss (@leonievoss) — UX Design Lead & Accessibility Strategist

Verdict: Approved

No UI, frontend, or accessibility changes in this PR. The changes are confined to:

  • OCR service infrastructure (Dockerfile, entrypoint.sh, Python tests)
  • Docker Compose configuration (docker-compose.yml, docker-compose.prod.yml)
  • Documentation (ADR-021, README env var table, CLAUDE.md LLM reminder)

Nothing to review from a UX or accessibility perspective.

The fix indirectly benefits the transcription experience — fewer ENOSPC failures during Surya model downloads means more reliable OCR results for the people transcribing letters — but no user-facing changes are introduced and no design decisions are affected.

## 🎨 Leonie Voss (@leonievoss) — UX Design Lead & Accessibility Strategist **Verdict: ✅ Approved** No UI, frontend, or accessibility changes in this PR. The changes are confined to: - OCR service infrastructure (Dockerfile, `entrypoint.sh`, Python tests) - Docker Compose configuration (`docker-compose.yml`, `docker-compose.prod.yml`) - Documentation (ADR-021, README env var table, CLAUDE.md LLM reminder) Nothing to review from a UX or accessibility perspective. The fix indirectly benefits the transcription experience — fewer `ENOSPC` failures during Surya model downloads means more reliable OCR results for the people transcribing letters — but no user-facing changes are introduced and no design decisions are affected.
marcel added 6 commits 2026-05-18 11:23:59 +02:00
_validate_zip_entry has no ML-stack dependency; importing it via main.py
pulled in surya/torch and caused the test to be skipped in CI. Moving it
to utils.py (fastapi only) and adding fastapi to the CI lightweight install
lets test_zipslip_still_anchors_under_custom_tmpdir run on every push.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A silent non-zero exit would previously cause the test to pass incorrectly
because only directory creation was checked. Exit code is now the first
assertion, catching regressions before the filesystem check runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test_entrypoint_removes_day_old_orphans and test_entrypoint_preserves_fresh_files
verify the find -mtime +1 -delete logic using os.utime() to fabricate old mtimes
without mocking system time. Also extracts _run_entrypoint helper to remove
subprocess setup duplication.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- entrypoint.sh: replace "cross-job ground-truth leakage" with plain
  "Remove stale partial downloads left by a previous docker-kill"
- test_tmpdir_is_inside_persistent_cache_volume: add docker exec command
  so future developers know how to run this deployment-contract test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
alpine:3 is a moving tag — pinning to 3.21 makes builds reproducible and
rollbacks possible. networks: [] removes the init container from the project
network since it only needs volume access, not network access (least privilege).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docs(deployment): document ocr-volume-init bootstrap service in §8 upgrade notes
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m1s
CI / OCR Service Tests (pull_request) Successful in 20s
CI / Backend Unit Tests (pull_request) Successful in 3m0s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 59s
CI / Unit & Component Tests (push) Successful in 3m5s
CI / OCR Service Tests (push) Successful in 19s
CI / Backend Unit Tests (push) Successful in 3m1s
CI / fail2ban Regex (push) Successful in 43s
CI / Semgrep Security Scan (push) Successful in 18s
CI / Compose Bucket Idempotency (push) Successful in 59s
193a4d6ee6
Explains what ocr-volume-init does (chown volumes + create TMPDIR), how to
verify it succeeded (docker logs), and what failure looks like. Addresses
reviewer concerns from @mkeller and @tobiwendt on PR #615.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

Review concerns addressed

All open concerns from the review round resolved across 5 commits on this branch.


Blockers fixed

alpine:3 pinned to alpine:3.21 — both docker-compose.yml and docker-compose.prod.yml (3182da8d)

Addresses @tobiwendt + @nullx: unpinned tag was non-reproducible and a low-probability supply chain risk.

DEPLOYMENT.md updated — §8 Upgrade notes now documents ocr-volume-init: what it does (chown volumes + create TMPDIR), how to verify it succeeded (docker logs archiv-ocr-volume-init), and what failure looks like (chown denied → non-zero exit → ocr-service won't start) (193a4d6e)

Addresses @mkeller + @tobiwendt. C4 diagram update waived per the "init container = infrastructure step, not a software system with a port/API surface" framing — DEPLOYMENT.md is the load-bearing reference.


Non-blocking concerns resolved

_validate_zip_entry extracted to utils.py — no ML-stack imports; test_zipslip_still_anchors_under_custom_tmpdir now runs in CI without the skip (c2bd1b34)

Addresses @nullx Option 1. CI Install test dependencies step updated to include fastapi==0.115.6 (the only non-stdlib dependency of utils.py).

assert result.returncode == 0 added to test_entrypoint_creates_tmpdir — exit code is now the first assertion (e31dac5c)

Addresses @felixbrandt + @saraholt.

Orphan cleanup behavior testedtest_entrypoint_removes_day_old_orphans (fabricates 2-day-old mtime via os.utime(), asserts file deleted) and test_entrypoint_preserves_fresh_files (current mtime, asserts file survives) added (775b5c06)

Addresses @saraholt + @felixbrandt.

Manual run comment added to test_tmpdir_is_inside_persistent_cache_volumedocker exec archiv-ocr python -m pytest test_tmpdir.py::test_tmpdir_is_inside_persistent_cache_volume -v (6839cf2a)

Addresses @saraholt.

entrypoint.sh comment fixed — "cross-job ground-truth leakage" → "Remove stale partial downloads left by a previous docker-kill" (6839cf2a)

Addresses @felixbrandt.

networks: [] added to both ocr-volume-init definitions (3182da8d)

Addresses @tobiwendt: init container only needs volume access, not network access.


Final test run

47 passed, 1 skipped in 2.48s

(1 skip = test_tmpdir_is_inside_persistent_cache_volume, intentional — deployment-contract test that only runs inside the OCR container)

## Review concerns addressed All open concerns from the review round resolved across 5 commits on this branch. --- ### Blockers fixed **`alpine:3` pinned to `alpine:3.21`** — both `docker-compose.yml` and `docker-compose.prod.yml` (`3182da8d`) > Addresses @tobiwendt + @nullx: unpinned tag was non-reproducible and a low-probability supply chain risk. **`DEPLOYMENT.md` updated** — §8 Upgrade notes now documents `ocr-volume-init`: what it does (chown volumes + create TMPDIR), how to verify it succeeded (`docker logs archiv-ocr-volume-init`), and what failure looks like (`chown` denied → non-zero exit → `ocr-service` won't start) (`193a4d6e`) > Addresses @mkeller + @tobiwendt. C4 diagram update waived per the "init container = infrastructure step, not a software system with a port/API surface" framing — DEPLOYMENT.md is the load-bearing reference. --- ### Non-blocking concerns resolved **`_validate_zip_entry` extracted to `utils.py`** — no ML-stack imports; `test_zipslip_still_anchors_under_custom_tmpdir` now runs in CI without the skip (`c2bd1b34`) > Addresses @nullx Option 1. CI `Install test dependencies` step updated to include `fastapi==0.115.6` (the only non-stdlib dependency of `utils.py`). **`assert result.returncode == 0` added to `test_entrypoint_creates_tmpdir`** — exit code is now the first assertion (`e31dac5c`) > Addresses @felixbrandt + @saraholt. **Orphan cleanup behavior tested** — `test_entrypoint_removes_day_old_orphans` (fabricates 2-day-old mtime via `os.utime()`, asserts file deleted) and `test_entrypoint_preserves_fresh_files` (current mtime, asserts file survives) added (`775b5c06`) > Addresses @saraholt + @felixbrandt. **Manual run comment added** to `test_tmpdir_is_inside_persistent_cache_volume` — `docker exec archiv-ocr python -m pytest test_tmpdir.py::test_tmpdir_is_inside_persistent_cache_volume -v` (`6839cf2a`) > Addresses @saraholt. **`entrypoint.sh` comment fixed** — "cross-job ground-truth leakage" → "Remove stale partial downloads left by a previous docker-kill" (`6839cf2a`) > Addresses @felixbrandt. **`networks: []` added** to both `ocr-volume-init` definitions (`3182da8d`) > Addresses @tobiwendt: init container only needs volume access, not network access. --- ### Final test run ``` 47 passed, 1 skipped in 2.48s ``` (1 skip = `test_tmpdir_is_inside_persistent_cache_volume`, intentional — deployment-contract test that only runs inside the OCR container)
Author
Owner

🏛️ Markus Keller — Senior Application Architect

Verdict: Approved

What I checked

This PR resolves a storage-tier routing mistake, not a capacity problem — that framing is exactly right, and the ADR reflects it.


Positives

  • ADR-021 is well-written. Context, decision, alternatives (B: enlarge tmpfs — rightly rejected on cgroup mem_limit grounds; C: belt-and-suspenders — correctly rejected as unnecessary churn), and consequences including the partial-download orphan trade-off. This is what ADRs are for.
  • ocr-volume-init is minimal and correctly scoped. restart: "no", condition: service_completed_successfully, and networks: [] are all correct. The init container cannot loop, cannot start ocr-service on failure, and is not reachable from the app network.
  • DEPLOYMENT.md upgrade notes are present — failure mode, verification commands, and expected output are all documented. Operators have a runbook.
  • The fix is the minimum change. TMPDIR redirect + entrypoint fallback + volume-init. No new infrastructure components, no new dependencies.

Concern (non-blocking)

docs/architecture/c4/l2-containers.puml — the doc requirements table says a new Docker service triggers an update to the L2 container diagram. ocr-volume-init is a new Docker service in both compose files. I'd argue a transient init container is more of an operational detail than an architectural component, so I'm not flagging this as a blocker — but it's worth a decision: either add it to the diagram with a "«init»" stereotype, or document explicitly that ephemeral bootstrap services are excluded from the L2 diagram. Right now the policy is silent on this case.


Summary

The architecture is sound. The change routes model downloads to the right storage tier without touching the security boundary, the /tmp DoS cap, or the ZIP Slip protection. ADR-021 is load-bearing documentation — the CLAUDE.md reminder points future LLMs at it. Approve.

## 🏛️ Markus Keller — Senior Application Architect **Verdict: ✅ Approved** ### What I checked This PR resolves a storage-tier routing mistake, not a capacity problem — that framing is exactly right, and the ADR reflects it. --- ### Positives - **ADR-021 is well-written.** Context, decision, alternatives (B: enlarge tmpfs — rightly rejected on cgroup mem_limit grounds; C: belt-and-suspenders — correctly rejected as unnecessary churn), and consequences including the partial-download orphan trade-off. This is what ADRs are for. - **`ocr-volume-init` is minimal and correctly scoped.** `restart: "no"`, `condition: service_completed_successfully`, and `networks: []` are all correct. The init container cannot loop, cannot start `ocr-service` on failure, and is not reachable from the app network. - **DEPLOYMENT.md upgrade notes are present** — failure mode, verification commands, and expected output are all documented. Operators have a runbook. - **The fix is the minimum change.** TMPDIR redirect + entrypoint fallback + volume-init. No new infrastructure components, no new dependencies. --- ### Concern (non-blocking) **`docs/architecture/c4/l2-containers.puml`** — the doc requirements table says a new Docker service triggers an update to the L2 container diagram. `ocr-volume-init` is a new Docker service in both compose files. I'd argue a transient init container is more of an operational detail than an architectural component, so I'm not flagging this as a blocker — but it's worth a decision: either add it to the diagram with a "«init»" stereotype, or document explicitly that ephemeral bootstrap services are excluded from the L2 diagram. Right now the policy is silent on this case. --- ### Summary The architecture is sound. The change routes model downloads to the right storage tier without touching the security boundary, the `/tmp` DoS cap, or the ZIP Slip protection. ADR-021 is load-bearing documentation — the CLAUDE.md reminder points future LLMs at it. Approve.
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: Approved

What I checked

TDD evidence, naming, function size, Python patterns, and the utils.py extraction.


Positives

  • _validate_zip_entry extraction to utils.py is clean. The function was buried in main.py alongside FastAPI app setup and business logic. It now lives in a module with a single purpose and no ML-stack imports — this is why it can be tested in CI without PyTorch. Good judgment call.
  • Test names read as sentences. test_entrypoint_creates_tmpdir, test_entrypoint_removes_day_old_orphans, test_entrypoint_preserves_fresh_files, test_zipslip_still_anchors_under_custom_tmpdir — each name tells me exactly what behavior is under test and what the expected outcome is.
  • TDD is evidenced. The PR description explicitly states: "confirmed RED before entrypoint change, GREEN after." That's the red proof we need.
  • Both sides of the mtime boundary are tested. Day-old file → deleted. Fresh file → preserved. That's the full contract.
  • test_tmpdir_is_inside_persistent_cache_volume skip annotation is correctly written — skip reason explains why the test doesn't run in CI and includes the manual docker exec command. Not a disabled test; a conditionally enabled one.

Suggestions (non-blocking)

Missing return type hint on _run_entrypoint (test_tmpdir.py):

# Current
def _run_entrypoint(tmpdir, tmp_path):

# Should be
def _run_entrypoint(tmpdir: Path, tmp_path: Path) -> subprocess.CompletedProcess[str]:

The existing code uses pathlib.Path objects (from tmp_path) but the type is unannounced. Minor — this is a test helper, not production code — but the rest of the codebase is consistently typed.

test_tempfile_uses_tmpdir_when_set doesn't test the fallback path. The entrypoint uses ${TMPDIR:-/tmp}, so when TMPDIR is unset it falls back to /tmp. There's no test asserting tempfile.TemporaryDirectory() uses /tmp when TMPDIR is absent. Not a blocker, but a mirror test would complete the contract.


Summary

Clean extraction, good TDD evidence, well-named tests. The two suggestions are polish, not correctness issues. Approve.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** ### What I checked TDD evidence, naming, function size, Python patterns, and the `utils.py` extraction. --- ### Positives - **`_validate_zip_entry` extraction to `utils.py` is clean.** The function was buried in `main.py` alongside FastAPI app setup and business logic. It now lives in a module with a single purpose and no ML-stack imports — this is why it can be tested in CI without PyTorch. Good judgment call. - **Test names read as sentences.** `test_entrypoint_creates_tmpdir`, `test_entrypoint_removes_day_old_orphans`, `test_entrypoint_preserves_fresh_files`, `test_zipslip_still_anchors_under_custom_tmpdir` — each name tells me exactly what behavior is under test and what the expected outcome is. - **TDD is evidenced.** The PR description explicitly states: "confirmed RED before entrypoint change, GREEN after." That's the red proof we need. - **Both sides of the mtime boundary are tested.** Day-old file → deleted. Fresh file → preserved. That's the full contract. - **`test_tmpdir_is_inside_persistent_cache_volume` skip annotation** is correctly written — skip reason explains *why* the test doesn't run in CI and includes the manual docker exec command. Not a disabled test; a conditionally enabled one. --- ### Suggestions (non-blocking) **Missing return type hint on `_run_entrypoint`** (`test_tmpdir.py`): ```python # Current def _run_entrypoint(tmpdir, tmp_path): # Should be def _run_entrypoint(tmpdir: Path, tmp_path: Path) -> subprocess.CompletedProcess[str]: ``` The existing code uses `pathlib.Path` objects (from `tmp_path`) but the type is unannounced. Minor — this is a test helper, not production code — but the rest of the codebase is consistently typed. **`test_tempfile_uses_tmpdir_when_set` doesn't test the fallback path.** The entrypoint uses `${TMPDIR:-/tmp}`, so when `TMPDIR` is unset it falls back to `/tmp`. There's no test asserting `tempfile.TemporaryDirectory()` uses `/tmp` when `TMPDIR` is absent. Not a blocker, but a mirror test would complete the contract. --- ### Summary Clean extraction, good TDD evidence, well-named tests. The two suggestions are polish, not correctness issues. Approve.
Author
Owner

🔧 Tobias Wendt — DevOps & Platform Engineer

Verdict: Approved

What I checked

Compose changes, image pinning, service ordering, CI pipeline, and startup behavior.


Positives

  • alpine:3.21 pinned. Not :latest. Renovate will bump it.
  • restart: "no" on ocr-volume-init. If chown fails, the container exits non-zero and stays dead — it doesn't loop. ocr-service is blocked by condition: service_completed_successfully, so the dependency chain fails loudly. This is the correct failure posture.
  • networks: [] drops the init container from archiv-net. It only needs volume access, not network access. Minimal exposure surface.
  • CI dependency update is correct. Adding fastapi==0.115.6 to the test pip install is required because utils.py imports from fastapi import HTTPException and test_tmpdir.py imports from utils import _validate_zip_entry. Without this, the OCR CI job would fail on import. The pinned version is good practice.
  • DEPLOYMENT.md upgrade notes are complete — log command, expected output, and failure mode description are all present. An operator following this guide cold can verify the init container succeeded without guessing.

Concern (non-blocking)

chown -R 1000:1000 /app/cache /app/models runs on every docker compose up. On a fully-populated cache volume with the Surya text_recognition model (~1.34 GB in many small files), this recursive chown adds 1–4 seconds to every startup. Not a problem today; worth monitoring if startup time becomes a friction point. One mitigation when the time comes: run chown only if the volume is newly created (e.g. check a sentinel file). Pre-existing issue context: this replaces a manual one-time step, so running it idempotently on every up is acceptable for now.

Volume naming inconsistency between files (pre-existing, not introduced here). Dev uses ocr_models / ocr_cache (underscores); prod uses ocr-models / ocr-cache (hyphens). This PR correctly follows the existing convention in each file. Not a problem this PR introduced, just noting it for awareness.


Summary

Compose changes are minimal and correct. Init container is properly scoped. CI pipeline is updated with the right dependency. The recurring chown -R is an acceptable trade-off at current scale. Approve.

## 🔧 Tobias Wendt — DevOps & Platform Engineer **Verdict: ✅ Approved** ### What I checked Compose changes, image pinning, service ordering, CI pipeline, and startup behavior. --- ### Positives - **`alpine:3.21` pinned.** Not `:latest`. Renovate will bump it. ✅ - **`restart: "no"` on `ocr-volume-init`.** If chown fails, the container exits non-zero and stays dead — it doesn't loop. `ocr-service` is blocked by `condition: service_completed_successfully`, so the dependency chain fails loudly. This is the correct failure posture. - **`networks: []`** drops the init container from `archiv-net`. It only needs volume access, not network access. Minimal exposure surface. ✅ - **CI dependency update is correct.** Adding `fastapi==0.115.6` to the test pip install is required because `utils.py` imports `from fastapi import HTTPException` and `test_tmpdir.py` imports `from utils import _validate_zip_entry`. Without this, the OCR CI job would fail on import. The pinned version is good practice. - **`DEPLOYMENT.md` upgrade notes are complete** — log command, expected output, and failure mode description are all present. An operator following this guide cold can verify the init container succeeded without guessing. --- ### Concern (non-blocking) **`chown -R 1000:1000 /app/cache /app/models` runs on every `docker compose up`.** On a fully-populated cache volume with the Surya `text_recognition` model (~1.34 GB in many small files), this recursive chown adds 1–4 seconds to every startup. Not a problem today; worth monitoring if startup time becomes a friction point. One mitigation when the time comes: run chown only if the volume is newly created (e.g. check a sentinel file). Pre-existing issue context: this replaces a manual one-time step, so running it idempotently on every up is acceptable for now. **Volume naming inconsistency between files (pre-existing, not introduced here).** Dev uses `ocr_models` / `ocr_cache` (underscores); prod uses `ocr-models` / `ocr-cache` (hyphens). This PR correctly follows the existing convention in each file. Not a problem this PR introduced, just noting it for awareness. --- ### Summary Compose changes are minimal and correct. Init container is properly scoped. CI pipeline is updated with the right dependency. The recurring `chown -R` is an acceptable trade-off at current scale. Approve.
Author
Owner

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: Approved

What I checked

TMPDIR redirect attack surface, ocr-volume-init privilege model, _validate_zip_entry extraction, ZIP Slip regression, and entrypoint orphan cleanup.


Positives

ZIP Slip protection is maintained and explicitly regression-tested.

_validate_zip_entry was extracted from main.py to utils.py without modification — the logic is identical:

def _validate_zip_entry(name: str, extract_dir: str) -> None:
    if os.path.isabs(name) or name.startswith(".."):
        raise HTTPException(status_code=400, detail=f"Unsafe ZIP entry: {name}")
    resolved = os.path.realpath(os.path.join(extract_dir, name))
    if not resolved.startswith(os.path.realpath(extract_dir)):
        raise HTTPException(status_code=400, detail=f"ZIP Slip detected: {name}")

The critical question with a TMPDIR redirect is: does os.path.realpath() still anchor correctly when extract_dir is /app/cache/.tmp/tmpXXX instead of /tmp/tmpXXX? Yes — realpath() resolves the full canonical path regardless of where the extraction directory lives. test_zipslip_still_anchors_under_custom_tmpdir explicitly tests this with the new TMPDIR path.

ADR-021 explicitly documents the security non-impact:

"ZIP Slip protection in _validate_zip_entry() is unaffected — it uses os.path.realpath() anchored to the extraction directory regardless of where that directory lives."

This is the right thing to document — a future developer seeing TMPDIR pointing somewhere unusual shouldn't have to reason through the ZIP Slip logic from scratch.

ocr-volume-init privilege model is appropriate. The init container runs as Alpine's default root user, which is required to chown volumes created by Docker (owned by root). The container has no network access (networks: []), no capabilities beyond what Alpine provides by default, and exits immediately after chown + mkdir. The attack surface of a container that exits in <1 second with no network and no entrypoint persistence is minimal.

Entrypoint orphan cleanup is safe.

find "${TMPDIR:-/tmp}" -mindepth 1 -mtime +1 -delete 2>/dev/null || true

-mindepth 1 prevents deleting TMPDIR itself. The cleanup is scoped to the TMPDIR subtree — it cannot traverse upward. 2>/dev/null || true ensures a find error (e.g. permission denied on a subdirectory) doesn't abort the entrypoint. Safe.


No findings.

This PR does not touch authentication, authorization, SSRF protection, or any security boundary. The change is purely about which storage tier receives model download staging. Security posture is unchanged or improved (the ZIP Slip test now has broader coverage). Approve.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ✅ Approved** ### What I checked TMPDIR redirect attack surface, `ocr-volume-init` privilege model, `_validate_zip_entry` extraction, ZIP Slip regression, and entrypoint orphan cleanup. --- ### Positives **ZIP Slip protection is maintained and explicitly regression-tested.** `_validate_zip_entry` was extracted from `main.py` to `utils.py` without modification — the logic is identical: ```python def _validate_zip_entry(name: str, extract_dir: str) -> None: if os.path.isabs(name) or name.startswith(".."): raise HTTPException(status_code=400, detail=f"Unsafe ZIP entry: {name}") resolved = os.path.realpath(os.path.join(extract_dir, name)) if not resolved.startswith(os.path.realpath(extract_dir)): raise HTTPException(status_code=400, detail=f"ZIP Slip detected: {name}") ``` The critical question with a TMPDIR redirect is: does `os.path.realpath()` still anchor correctly when `extract_dir` is `/app/cache/.tmp/tmpXXX` instead of `/tmp/tmpXXX`? Yes — `realpath()` resolves the full canonical path regardless of where the extraction directory lives. `test_zipslip_still_anchors_under_custom_tmpdir` explicitly tests this with the new TMPDIR path. ✅ **ADR-021 explicitly documents the security non-impact:** > "ZIP Slip protection in `_validate_zip_entry()` is unaffected — it uses `os.path.realpath()` anchored to the extraction directory regardless of where that directory lives." This is the right thing to document — a future developer seeing TMPDIR pointing somewhere unusual shouldn't have to reason through the ZIP Slip logic from scratch. **`ocr-volume-init` privilege model is appropriate.** The init container runs as Alpine's default root user, which is required to `chown` volumes created by Docker (owned by root). The container has no network access (`networks: []`), no capabilities beyond what Alpine provides by default, and exits immediately after chown + mkdir. The attack surface of a container that exits in <1 second with no network and no entrypoint persistence is minimal. **Entrypoint orphan cleanup is safe.** ```bash find "${TMPDIR:-/tmp}" -mindepth 1 -mtime +1 -delete 2>/dev/null || true ``` `-mindepth 1` prevents deleting TMPDIR itself. The cleanup is scoped to the TMPDIR subtree — it cannot traverse upward. `2>/dev/null || true` ensures a find error (e.g. permission denied on a subdirectory) doesn't abort the entrypoint. Safe. --- ### No findings. This PR does not touch authentication, authorization, SSRF protection, or any security boundary. The change is purely about which storage tier receives model download staging. Security posture is unchanged or improved (the ZIP Slip test now has broader coverage). Approve.
Author
Owner

📋 Elicit — Requirements Engineer

Verdict: Approved

What I checked

Traceability from issue #614 to acceptance criteria, completeness of the PR test plan, and whether the stated ACs map to the implemented behavior.


Requirements Traceability

Acceptance Criterion Implementation Status
AC #1: CI OCR tests pass test_tmpdir.py added to CI; fastapi dep added
AC #2: TDD red/green confirmed PR description: "confirmed RED before entrypoint change, GREEN after"
AC #3: docker compose config contains TMPDIR: /app/cache/.tmp Both compose files updated with env var
AC #4: Staging dry-run (manual) Out of scope for this review; requires staging environment 🔲 Manual
AC #5: stat -c %u /app/cache/.tmp returns 1000 ocr-volume-init chowns /app/cache/.tmp to 1000:1000
AC #6 (referenced in description): /app/cache/.tmp created on fresh volumes ocr-volume-init runs mkdir -p /app/cache/.tmp; entrypoint.sh as fallback

The root cause (routing problem, not capacity) is correctly identified and the issue title accurately reflects the fix.


Observation (non-blocking)

The failure mode when mkdir -p /app/cache/.tmp fails inside ocr-volume-init is undocumented in DEPLOYMENT.md. The chown failure mode is documented ("if chown is denied, container exits non-zero"). The mkdir step is in the same sh -c command joined by &&, so a mkdir failure would also cause a non-zero exit — but this isn't stated explicitly. A reader following the upgrade notes might not realize the mkdir step can fail independently of chown (e.g., if the volume is nearly full at the filesystem level). Minor documentation gap; not a correctness issue.


Summary

All automated acceptance criteria are implemented and testable. The PR description is precise and includes the manual staging AC that can't be automated. Requirements are fully traceable from issue → implementation → test. Approve.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** ### What I checked Traceability from issue #614 to acceptance criteria, completeness of the PR test plan, and whether the stated ACs map to the implemented behavior. --- ### Requirements Traceability | Acceptance Criterion | Implementation | Status | |---|---|---| | AC #1: CI OCR tests pass | `test_tmpdir.py` added to CI; `fastapi` dep added | ✅ | | AC #2: TDD red/green confirmed | PR description: "confirmed RED before entrypoint change, GREEN after" | ✅ | | AC #3: `docker compose config` contains `TMPDIR: /app/cache/.tmp` | Both compose files updated with env var | ✅ | | AC #4: Staging dry-run (manual) | Out of scope for this review; requires staging environment | 🔲 Manual | | AC #5: `stat -c %u /app/cache/.tmp` returns `1000` | `ocr-volume-init` chowns `/app/cache/.tmp` to 1000:1000 | ✅ | | AC #6 (referenced in description): `/app/cache/.tmp` created on fresh volumes | `ocr-volume-init` runs `mkdir -p /app/cache/.tmp`; `entrypoint.sh` as fallback | ✅ | The root cause (routing problem, not capacity) is correctly identified and the issue title accurately reflects the fix. --- ### Observation (non-blocking) **The failure mode when `mkdir -p /app/cache/.tmp` fails inside `ocr-volume-init` is undocumented in DEPLOYMENT.md.** The chown failure mode is documented ("if chown is denied, container exits non-zero"). The mkdir step is in the same `sh -c` command joined by `&&`, so a mkdir failure would also cause a non-zero exit — but this isn't stated explicitly. A reader following the upgrade notes might not realize the mkdir step can fail independently of chown (e.g., if the volume is nearly full at the filesystem level). Minor documentation gap; not a correctness issue. --- ### Summary All automated acceptance criteria are implemented and testable. The PR description is precise and includes the manual staging AC that can't be automated. Requirements are fully traceable from issue → implementation → test. Approve.
Author
Owner

🧪 Sara Holt — Senior QA Engineer

Verdict: Approved

What I checked

Test pyramid placement, test naming, TDD evidence, CI pipeline update, coverage of the mtime boundary, skip annotation quality, and missing edge cases.


Positives

TDD evidence is explicit and credible. The PR description states "confirmed RED before entrypoint change, GREEN after" — that's the red proof I want to see. The failing test existed before the entrypoint mkdir -p line was added.

Five distinct behaviors, five distinct tests. No bundled assertions, no multi-behavior test functions:

Test Behavior verified
test_tempfile_uses_tmpdir_when_set Python stdlib respects TMPDIR env var
test_entrypoint_creates_tmpdir Entrypoint creates TMPDIR when absent
test_tmpdir_is_inside_persistent_cache_volume Contract: TMPDIR must be under /app/cache
test_entrypoint_removes_day_old_orphans Files older than 1 day are deleted
test_entrypoint_preserves_fresh_files Files newer than 1 day are kept
test_zipslip_still_anchors_under_custom_tmpdir ZIP Slip protection works under non-default TMPDIR

Mtime boundary is tested on both sides. test_entrypoint_removes_day_old_orphans uses os.utime() to backdate by 2 days (clearly over threshold). test_entrypoint_preserves_fresh_files uses the current mtime (clearly under threshold). Together they fully specify the find -mtime +1 behavior.

Skip annotation is correct. @pytest.mark.skipif(not os.environ.get("TMPDIR", "").startswith("/app/cache"), reason=...) is a conditional skip, not a disabled test. The skip reason is precise, and the manual run command is documented in the docstring. This is the right pattern.

CI pipeline updated. test_tmpdir.py is added to the pytest invocation. fastapi==0.115.6 is pinned in the pip install step — required because utils.py imports HTTPException.


Suggestions (non-blocking)

Missing test: unset TMPDIR fallback path. The entrypoint uses ${TMPDIR:-/tmp}, which means when TMPDIR is unset the mkdir and find operate on /tmp. There's no test covering this path. A test stub:

def test_entrypoint_creates_tmpdir_when_tmpdir_unset(tmp_path):
    """Entrypoint uses /tmp as fallback when TMPDIR is not set."""
    env = {k: v for k, v in os.environ.items() if k != "TMPDIR"}
    # stub python3/uvicorn as in _run_entrypoint ...
    result = subprocess.run(["bash", _ENTRYPOINT], env=env, capture_output=True, text=True)
    assert result.returncode == 0

This would catch a regression if someone changes ${TMPDIR:-/tmp} to ${TMPDIR} (which would cause mkdir -p "" or a bash error).

_run_entrypoint helper duplicates setup from test_entrypoint_creates_tmpdir. The stub-bin creation logic appears twice (once inline in the first long test, once via _run_entrypoint). The first test doesn't use _run_entrypoint — it duplicates the setup manually. Extracting to _run_entrypoint was the right call; applying it to the first test too would reduce duplication. Minor refactor, not a correctness issue.


Summary

Test coverage is solid for the implemented behaviors. TDD discipline was followed. CI is updated. The two suggestions are completeness items, not correctness failures. Approve.

## 🧪 Sara Holt — Senior QA Engineer **Verdict: ✅ Approved** ### What I checked Test pyramid placement, test naming, TDD evidence, CI pipeline update, coverage of the mtime boundary, skip annotation quality, and missing edge cases. --- ### Positives **TDD evidence is explicit and credible.** The PR description states "confirmed RED before entrypoint change, GREEN after" — that's the red proof I want to see. The failing test existed before the entrypoint `mkdir -p` line was added. ✅ **Five distinct behaviors, five distinct tests.** No bundled assertions, no multi-behavior test functions: | Test | Behavior verified | |---|---| | `test_tempfile_uses_tmpdir_when_set` | Python stdlib respects TMPDIR env var | | `test_entrypoint_creates_tmpdir` | Entrypoint creates TMPDIR when absent | | `test_tmpdir_is_inside_persistent_cache_volume` | Contract: TMPDIR must be under /app/cache | | `test_entrypoint_removes_day_old_orphans` | Files older than 1 day are deleted | | `test_entrypoint_preserves_fresh_files` | Files newer than 1 day are kept | | `test_zipslip_still_anchors_under_custom_tmpdir` | ZIP Slip protection works under non-default TMPDIR | **Mtime boundary is tested on both sides.** `test_entrypoint_removes_day_old_orphans` uses `os.utime()` to backdate by 2 days (clearly over threshold). `test_entrypoint_preserves_fresh_files` uses the current mtime (clearly under threshold). Together they fully specify the `find -mtime +1` behavior. ✅ **Skip annotation is correct.** `@pytest.mark.skipif(not os.environ.get("TMPDIR", "").startswith("/app/cache"), reason=...)` is a conditional skip, not a disabled test. The skip reason is precise, and the manual run command is documented in the docstring. This is the right pattern. ✅ **CI pipeline updated.** `test_tmpdir.py` is added to the pytest invocation. `fastapi==0.115.6` is pinned in the pip install step — required because `utils.py` imports `HTTPException`. ✅ --- ### Suggestions (non-blocking) **Missing test: unset TMPDIR fallback path.** The entrypoint uses `${TMPDIR:-/tmp}`, which means when TMPDIR is unset the mkdir and find operate on `/tmp`. There's no test covering this path. A test stub: ```python def test_entrypoint_creates_tmpdir_when_tmpdir_unset(tmp_path): """Entrypoint uses /tmp as fallback when TMPDIR is not set.""" env = {k: v for k, v in os.environ.items() if k != "TMPDIR"} # stub python3/uvicorn as in _run_entrypoint ... result = subprocess.run(["bash", _ENTRYPOINT], env=env, capture_output=True, text=True) assert result.returncode == 0 ``` This would catch a regression if someone changes `${TMPDIR:-/tmp}` to `${TMPDIR}` (which would cause `mkdir -p ""` or a bash error). **`_run_entrypoint` helper duplicates setup from `test_entrypoint_creates_tmpdir`.** The stub-bin creation logic appears twice (once inline in the first long test, once via `_run_entrypoint`). The first test doesn't use `_run_entrypoint` — it duplicates the setup manually. Extracting to `_run_entrypoint` was the right call; applying it to the first test too would reduce duplication. Minor refactor, not a correctness issue. --- ### Summary Test coverage is solid for the implemented behaviors. TDD discipline was followed. CI is updated. The two suggestions are completeness items, not correctness failures. Approve.
Author
Owner

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Verdict: Approved

What I checked

All 12 changed files for any user-facing UI, frontend component, or accessibility impact.


This PR is entirely infrastructure and DevOps:

  • Docker Compose changes (both environments)
  • Dockerfile ENV addition
  • Bash entrypoint changes
  • Python test file + utility module
  • CI workflow update
  • Documentation updates (ADR, DEPLOYMENT.md, README, CLAUDE.md)

There are no frontend files, no Svelte components, no routes, no Tailwind classes, and no i18n strings in this diff. No user-facing behavior is affected.

The docs/DEPLOYMENT.md prose additions are clear and followable by an operator — the command examples use proper code blocks and the expected outputs are explicit.

LGTM from a UX perspective.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist **Verdict: ✅ Approved** ### What I checked All 12 changed files for any user-facing UI, frontend component, or accessibility impact. --- This PR is entirely infrastructure and DevOps: - Docker Compose changes (both environments) - Dockerfile ENV addition - Bash entrypoint changes - Python test file + utility module - CI workflow update - Documentation updates (ADR, DEPLOYMENT.md, README, CLAUDE.md) There are **no frontend files, no Svelte components, no routes, no Tailwind classes, and no i18n strings** in this diff. No user-facing behavior is affected. The `docs/DEPLOYMENT.md` prose additions are clear and followable by an operator — the command examples use proper code blocks and the expected outputs are explicit. LGTM from a UX perspective.
marcel merged commit 193a4d6ee6 into main 2026-05-18 11:32:37 +02:00
marcel deleted branch feat/issue-614-tmpdir-persistent-volume 2026-05-18 11:32:37 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#615