docs(ocr): document TMPDIR convention and add ADR-021
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m7s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 3m7s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 18s
CI / Compose Bucket Idempotency (pull_request) Successful in 59s
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m7s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 3m7s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 18s
CI / Compose Bucket Idempotency (pull_request) Successful in 59s
- ocr-service/README.md: add HF_HOME, XDG_CACHE_HOME, TORCH_HOME, TMPDIR rows to the environment variables table - ocr-service/CLAUDE.md: LLM reminder — TMPDIR must stay on the cache volume - docs/adr/021-tmpdir-persistent-volume-staging.md: records the decision, trade-offs, and rejected alternatives (Approach B / C) for issue #614 - ci.yml: add test_tmpdir.py to the OCR CI run (stdlib-only tests, no ML stack) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
68
docs/adr/021-tmpdir-persistent-volume-staging.md
Normal file
68
docs/adr/021-tmpdir-persistent-volume-staging.md
Normal file
@@ -0,0 +1,68 @@
|
||||
# ADR-021 — Route Surya model-download staging to the persistent cache volume via TMPDIR
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2026-05-18
|
||||
**Issue:** #614
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
After the container hardening baseline (ADR-019), the OCR service runs with `read_only: true` and a 512 MB `/tmp` tmpfs. The tmpfs was sized for training-ZIP extraction (typically 20–50 images, well under 100 MB).
|
||||
|
||||
Surya's `download_directory()` (surya ≥ 0.6, `surya/common/s3.py`) stages every model file through `tempfile.TemporaryDirectory()` before moving it to the final cache location. `TemporaryDirectory()` honours `$TMPDIR` and falls back to `/tmp`. The `text_recognition` model is 1.34 GB; future Surya models will be in the same range. This blows the 512 MB budget at ~510 MB with `OSError: [Errno 28] No space left on device`.
|
||||
|
||||
The host has 1.8 TB free on the disk that backs `/app/cache`. The failure is a routing problem, not a capacity problem.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
Set `TMPDIR=/app/cache/.tmp` in the OCR container so all `tempfile` staging goes to the persistent SSD-backed cache volume.
|
||||
|
||||
```yaml
|
||||
# docker-compose.yml / docker-compose.prod.yml — ocr-service.environment
|
||||
TMPDIR: /app/cache/.tmp
|
||||
```
|
||||
|
||||
```dockerfile
|
||||
# ocr-service/Dockerfile — default for bare docker-run usage
|
||||
ENV TMPDIR=/app/cache/.tmp
|
||||
```
|
||||
|
||||
```bash
|
||||
# ocr-service/entrypoint.sh — idempotent directory bootstrap
|
||||
mkdir -p "${TMPDIR:-/tmp}"
|
||||
find "${TMPDIR:-/tmp}" -mindepth 1 -mtime +1 -delete 2>/dev/null || true
|
||||
```
|
||||
|
||||
A one-shot `ocr-volume-init` service in both compose files runs before `ocr-service` to `chown -R 1000:1000` the volumes and `mkdir -p /app/cache/.tmp`. This replaces the manual `docker run --rm alpine chown` step performed on 2026-05-18 and makes fresh-volume correctness a permanent infrastructure-as-code guarantee.
|
||||
|
||||
The `/tmp` tmpfs remains at 512 MB and continues to serve training-ZIP extraction and transient PDF buffers — its original purpose.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**
|
||||
|
||||
- Surya model downloads complete: 1.34 GB fits on the SSD, not in 512 MB of RAM.
|
||||
- `shutil.move()` from staging → cache becomes a same-filesystem `rename(2)` — atomic and near-free.
|
||||
- Volume ownership is now automated; no manual `docker run --rm alpine chown` on redeploy.
|
||||
- `/tmp` retains its small 512 MB DoS cap for attacker-influenceable training endpoints (post-auth only, behind `X-Training-Token`).
|
||||
- ZIP Slip protection in `_validate_zip_entry()` is unaffected — it uses `os.path.realpath()` anchored to the extraction directory regardless of where that directory lives.
|
||||
|
||||
**Negative / Trade-offs**
|
||||
|
||||
- If the container is `docker kill`ed mid-download, partial files persist in `/app/cache/.tmp` across container restarts. Mitigated by the `find -mtime +1 -delete` in `entrypoint.sh` — orphans older than one day are removed on startup.
|
||||
- `TMPDIR` pointing inside a volume mount is non-obvious. Any future move of `/app/cache` to a different storage tier must revisit this setting. This ADR is the load-bearing reference.
|
||||
|
||||
---
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
**Approach B — Enlarge `/tmp` to 4 GB**
|
||||
One-line change. Discarded because: (1) 4 GB tmpfs counts against the cgroup `mem_limit`; on CX32 hosts with `OCR_MEM_LIMIT=6g` the combined Surya resident set + tmpfs would trigger OOMKill on cold start; (2) staging GB-scale model files through RAM is using the wrong storage tier; (3) any future model larger than 4 GB requires another bump.
|
||||
|
||||
**Approach C — Both TMPDIR redirect and enlarged /tmp**
|
||||
Belt-and-suspenders: Approach A + 1 GB tmpfs. Discarded in favour of the cleaner Approach A. The defence-in-depth benefit does not outweigh the extra compose churn; the 512 MB cap on `/tmp` is intentional.
|
||||
Reference in New Issue
Block a user