From cfd49ff69ece15c73c582bec45f77d3316f080e5 Mon Sep 17 00:00:00 2001 From: Marcel Date: Mon, 18 May 2026 10:58:10 +0200 Subject: [PATCH] docs(ocr): document TMPDIR convention and add ADR-021 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - ocr-service/README.md: add HF_HOME, XDG_CACHE_HOME, TORCH_HOME, TMPDIR rows to the environment variables table - ocr-service/CLAUDE.md: LLM reminder — TMPDIR must stay on the cache volume - docs/adr/021-tmpdir-persistent-volume-staging.md: records the decision, trade-offs, and rejected alternatives (Approach B / C) for issue #614 - ci.yml: add test_tmpdir.py to the OCR CI run (stdlib-only tests, no ML stack) Co-Authored-By: Claude Sonnet 4.6 --- .gitea/workflows/ci.yml | 7 +- .../021-tmpdir-persistent-volume-staging.md | 68 +++++++++++++++++++ ocr-service/CLAUDE.md | 2 + ocr-service/README.md | 4 ++ 4 files changed, 79 insertions(+), 2 deletions(-) create mode 100644 docs/adr/021-tmpdir-persistent-volume-staging.md diff --git a/.gitea/workflows/ci.yml b/.gitea/workflows/ci.yml index a086f7c8..20882024 100644 --- a/.gitea/workflows/ci.yml +++ b/.gitea/workflows/ci.yml @@ -148,7 +148,10 @@ jobs: path: frontend/test-results/screenshots/ # ─── OCR Service Unit Tests ─────────────────────────────────────────────────── - # Only spell_check.py, test_confidence.py, test_sender_registry.py — no ML stack required. + # Only stdlib/lightweight tests — no ML stack (PyTorch/Surya/Kraken) required. + # test_tmpdir.py covers the TMPDIR env var and entrypoint mkdir behaviour (ADR-021). + # test_tmpdir_is_inside_persistent_cache_volume is skipped in CI (TMPDIR not + # set to /app/cache here); it runs inside the deployed Docker container. ocr-tests: name: OCR Service Tests runs-on: ubuntu-latest @@ -164,7 +167,7 @@ jobs: working-directory: ocr-service - name: Run OCR unit tests (no ML stack required) - run: python -m pytest test_spell_check.py test_confidence.py test_sender_registry.py -v + run: python -m pytest test_spell_check.py test_confidence.py test_sender_registry.py test_tmpdir.py -v working-directory: ocr-service # ─── Backend Unit & Slice Tests ─────────────────────────────────────────────── diff --git a/docs/adr/021-tmpdir-persistent-volume-staging.md b/docs/adr/021-tmpdir-persistent-volume-staging.md new file mode 100644 index 00000000..aaa7a203 --- /dev/null +++ b/docs/adr/021-tmpdir-persistent-volume-staging.md @@ -0,0 +1,68 @@ +# ADR-021 — Route Surya model-download staging to the persistent cache volume via TMPDIR + +**Status:** Accepted +**Date:** 2026-05-18 +**Issue:** #614 + +--- + +## Context + +After the container hardening baseline (ADR-019), the OCR service runs with `read_only: true` and a 512 MB `/tmp` tmpfs. The tmpfs was sized for training-ZIP extraction (typically 20–50 images, well under 100 MB). + +Surya's `download_directory()` (surya ≥ 0.6, `surya/common/s3.py`) stages every model file through `tempfile.TemporaryDirectory()` before moving it to the final cache location. `TemporaryDirectory()` honours `$TMPDIR` and falls back to `/tmp`. The `text_recognition` model is 1.34 GB; future Surya models will be in the same range. This blows the 512 MB budget at ~510 MB with `OSError: [Errno 28] No space left on device`. + +The host has 1.8 TB free on the disk that backs `/app/cache`. The failure is a routing problem, not a capacity problem. + +--- + +## Decision + +Set `TMPDIR=/app/cache/.tmp` in the OCR container so all `tempfile` staging goes to the persistent SSD-backed cache volume. + +```yaml +# docker-compose.yml / docker-compose.prod.yml — ocr-service.environment +TMPDIR: /app/cache/.tmp +``` + +```dockerfile +# ocr-service/Dockerfile — default for bare docker-run usage +ENV TMPDIR=/app/cache/.tmp +``` + +```bash +# ocr-service/entrypoint.sh — idempotent directory bootstrap +mkdir -p "${TMPDIR:-/tmp}" +find "${TMPDIR:-/tmp}" -mindepth 1 -mtime +1 -delete 2>/dev/null || true +``` + +A one-shot `ocr-volume-init` service in both compose files runs before `ocr-service` to `chown -R 1000:1000` the volumes and `mkdir -p /app/cache/.tmp`. This replaces the manual `docker run --rm alpine chown` step performed on 2026-05-18 and makes fresh-volume correctness a permanent infrastructure-as-code guarantee. + +The `/tmp` tmpfs remains at 512 MB and continues to serve training-ZIP extraction and transient PDF buffers — its original purpose. + +--- + +## Consequences + +**Positive** + +- Surya model downloads complete: 1.34 GB fits on the SSD, not in 512 MB of RAM. +- `shutil.move()` from staging → cache becomes a same-filesystem `rename(2)` — atomic and near-free. +- Volume ownership is now automated; no manual `docker run --rm alpine chown` on redeploy. +- `/tmp` retains its small 512 MB DoS cap for attacker-influenceable training endpoints (post-auth only, behind `X-Training-Token`). +- ZIP Slip protection in `_validate_zip_entry()` is unaffected — it uses `os.path.realpath()` anchored to the extraction directory regardless of where that directory lives. + +**Negative / Trade-offs** + +- If the container is `docker kill`ed mid-download, partial files persist in `/app/cache/.tmp` across container restarts. Mitigated by the `find -mtime +1 -delete` in `entrypoint.sh` — orphans older than one day are removed on startup. +- `TMPDIR` pointing inside a volume mount is non-obvious. Any future move of `/app/cache` to a different storage tier must revisit this setting. This ADR is the load-bearing reference. + +--- + +## Alternatives considered + +**Approach B — Enlarge `/tmp` to 4 GB** +One-line change. Discarded because: (1) 4 GB tmpfs counts against the cgroup `mem_limit`; on CX32 hosts with `OCR_MEM_LIMIT=6g` the combined Surya resident set + tmpfs would trigger OOMKill on cold start; (2) staging GB-scale model files through RAM is using the wrong storage tier; (3) any future model larger than 4 GB requires another bump. + +**Approach C — Both TMPDIR redirect and enlarged /tmp** +Belt-and-suspenders: Approach A + 1 GB tmpfs. Discarded in favour of the cleaner Approach A. The defence-in-depth benefit does not outweigh the extra compose churn; the 512 MB cap on `/tmp` is intentional. diff --git a/ocr-service/CLAUDE.md b/ocr-service/CLAUDE.md index f628c60b..09d4a895 100644 --- a/ocr-service/CLAUDE.md +++ b/ocr-service/CLAUDE.md @@ -5,3 +5,5 @@ **LLM reminder:** the OCR service is a **single-node container** — training reloads the model in-process, so multiple replicas cause model-state divergence (see ADR-001). All job tracking and business logic stay in Spring Boot; the Python service is stateless OCR only. **LLM reminder:** `ALLOWED_PDF_HOSTS` must never be set to `*` — that opens SSRF. The default (`minio,localhost,127.0.0.1`) is correct for dev. + +**LLM reminder:** `TMPDIR` points to `/app/cache/.tmp` (persistent SSD volume). Never redirect it back to `/tmp` or any RAM-backed path — `/tmp` is 512 MB and cannot stage GB-scale Surya model downloads (causes ENOSPC). The `ocr-volume-init` container creates the directory on fresh volumes; `entrypoint.sh` ensures it exists as a fallback. See ADR-021. diff --git a/ocr-service/README.md b/ocr-service/README.md index 976db06b..f8600cb9 100644 --- a/ocr-service/README.md +++ b/ocr-service/README.md @@ -32,6 +32,10 @@ Python FastAPI microservice that performs the actual handwritten text recognitio | `ALLOWED_PDF_HOSTS` | `minio,localhost,127.0.0.1` | YES | — | SSRF protection — comma-separated allowed PDF source hosts. Never set to `*`. | | `KRAKEN_MODEL_PATH` | `/app/models/` | — | — | Directory where Kraken HTR models are stored (populated by `download-kraken-models.sh`) | | `BLLA_MODEL_PATH` | `/app/models/blla.mlmodel` | — | — | Kraken baseline layout analysis model. Auto-downloaded via `ensure_blla_model.py` on startup if missing. | +| `HF_HOME` | `/app/cache` | — | — | HuggingFace model cache root. Keeps model downloads on the persistent cache volume. | +| `XDG_CACHE_HOME` | `/app/cache` | — | — | XDG cache root (used by some Surya components alongside `HF_HOME`). | +| `TORCH_HOME` | `/app/models/torch` | — | — | PyTorch model cache. Kept on the persistent models volume. | +| `TMPDIR` | `/app/cache/.tmp` | — | — | Download-staging directory for GB-scale Surya model files. Must point to a disk-backed path, not the 512 MB `/tmp` tmpfs — see ADR-021. | ## Key files