fix(ocr): route Surya model staging to SSD via TMPDIR + add volume-init service #615

Merged
marcel merged 10 commits from feat/issue-614-tmpdir-persistent-volume into main 2026-05-18 11:32:37 +02:00

10 Commits

Author SHA1 Message Date
Marcel
193a4d6ee6 docs(deployment): document ocr-volume-init bootstrap service in §8 upgrade notes
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m1s
CI / OCR Service Tests (pull_request) Successful in 20s
CI / Backend Unit Tests (pull_request) Successful in 3m0s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 59s
CI / Unit & Component Tests (push) Successful in 3m5s
CI / OCR Service Tests (push) Successful in 19s
CI / Backend Unit Tests (push) Successful in 3m1s
CI / fail2ban Regex (push) Successful in 43s
CI / Semgrep Security Scan (push) Successful in 18s
CI / Compose Bucket Idempotency (push) Successful in 59s
Explains what ocr-volume-init does (chown volumes + create TMPDIR), how to
verify it succeeded (docker logs), and what failure looks like. Addresses
reviewer concerns from @mkeller and @tobiwendt on PR #615.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 11:23:04 +02:00
Marcel
3182da8d92 fix(infra): pin ocr-volume-init to alpine:3.21 and drop project network
alpine:3 is a moving tag — pinning to 3.21 makes builds reproducible and
rollbacks possible. networks: [] removes the init container from the project
network since it only needs volume access, not network access (least privilege).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 11:21:55 +02:00
Marcel
6839cf2a33 docs(ocr): clarify entrypoint comment and add manual run hint for skipped test
- entrypoint.sh: replace "cross-job ground-truth leakage" with plain
  "Remove stale partial downloads left by a previous docker-kill"
- test_tmpdir_is_inside_persistent_cache_volume: add docker exec command
  so future developers know how to run this deployment-contract test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 11:20:45 +02:00
Marcel
775b5c062e test(ocr): add orphan cleanup behavior tests for entrypoint.sh find -mtime
test_entrypoint_removes_day_old_orphans and test_entrypoint_preserves_fresh_files
verify the find -mtime +1 -delete logic using os.utime() to fabricate old mtimes
without mocking system time. Also extracts _run_entrypoint helper to remove
subprocess setup duplication.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 11:19:33 +02:00
Marcel
e31dac5c9c test(ocr): assert entrypoint.sh exit code in test_entrypoint_creates_tmpdir
A silent non-zero exit would previously cause the test to pass incorrectly
because only directory creation was checked. Exit code is now the first
assertion, catching regressions before the filesystem check runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 11:18:14 +02:00
Marcel
c2bd1b34f0 refactor(ocr): extract _validate_zip_entry to utils.py so ZIP Slip test runs in CI
_validate_zip_entry has no ML-stack dependency; importing it via main.py
pulled in surya/torch and caused the test to be skipped in CI. Moving it
to utils.py (fastapi only) and adding fastapi to the CI lightweight install
lets test_zipslip_still_anchors_under_custom_tmpdir run on every push.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 11:17:15 +02:00
Marcel
cfd49ff69e docs(ocr): document TMPDIR convention and add ADR-021
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m7s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 3m7s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 18s
CI / Compose Bucket Idempotency (pull_request) Successful in 59s
- ocr-service/README.md: add HF_HOME, XDG_CACHE_HOME, TORCH_HOME, TMPDIR rows
  to the environment variables table
- ocr-service/CLAUDE.md: LLM reminder — TMPDIR must stay on the cache volume
- docs/adr/021-tmpdir-persistent-volume-staging.md: records the decision,
  trade-offs, and rejected alternatives (Approach B / C) for issue #614
- ci.yml: add test_tmpdir.py to the OCR CI run (stdlib-only tests, no ML stack)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 10:58:10 +02:00
Marcel
1f7b08b74f fix(ocr): add TMPDIR env var and ocr-volume-init service to compose files
TMPDIR=/app/cache/.tmp routes Surya model staging to the SSD-backed cache
volume instead of the 512 MB /tmp tmpfs. The ocr-volume-init one-shot service
runs first to ensure correct ownership (uid 1000) and creates /app/cache/.tmp
on fresh volumes, making AC #6 ("fresh volume still works") a permanent
infrastructure-as-code guarantee rather than a manual chown step.

Both docker-compose.yml and docker-compose.prod.yml are updated in the same
commit to prevent the silent drift that occurred with the 512 MB tmpfs comment.

Fixes #614. See ADR-021.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 10:56:10 +02:00
Marcel
240b373f68 fix(ocr): create TMPDIR on startup and clear day-old orphans
On a fresh ocr_cache volume /app/cache/.tmp does not exist yet. The mkdir
ensures the first Surya model download can proceed without ENOSPC on the
512 MB /tmp tmpfs. The find cleanup removes fragments left by docker-kill
mid-download, preventing cross-job ground-truth leakage.

Fixes #614. See ADR-021.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 10:54:17 +02:00
Marcel
09a043431e build(ocr): set ENV TMPDIR=/app/cache/.tmp so docker run uses SSD staging
Without this, running the image outside compose loses the TMPDIR redirect
and Surya model downloads fall back to the 512 MB /tmp tmpfs (ENOSPC).
See issue #614, ADR-021.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 10:53:15 +02:00